WO2023206420A1 - 视频编解码方法、装置、设备、系统及存储介质 - Google Patents

视频编解码方法、装置、设备、系统及存储介质 Download PDF

Info

Publication number
WO2023206420A1
WO2023206420A1 PCT/CN2022/090468 CN2022090468W WO2023206420A1 WO 2023206420 A1 WO2023206420 A1 WO 2023206420A1 CN 2022090468 W CN2022090468 W CN 2022090468W WO 2023206420 A1 WO2023206420 A1 WO 2023206420A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature information
quantized
predicted
reconstructed
Prior art date
Application number
PCT/CN2022/090468
Other languages
English (en)
French (fr)
Inventor
马展
刘浩杰
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to PCT/CN2022/090468 priority Critical patent/WO2023206420A1/zh
Publication of WO2023206420A1 publication Critical patent/WO2023206420A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Definitions

  • the present application relates to the technical field of video coding and decoding, and in particular to a video coding and decoding method, device, equipment, system and storage medium.
  • Digital video technology can be incorporated into a variety of video devices, such as digital televisions, smartphones, computers, e-readers, or video players.
  • video data includes a larger amount of data.
  • video devices implement video compression technology to make the video data more efficiently transmitted or stored.
  • neural network technology has been widely used in video compression technology, for example, in loop filtering, coding block division and coding block prediction.
  • video compression technology based on neural network has poor compression effect.
  • Embodiments of the present application provide a video encoding and decoding method, device, equipment, system and storage medium to improve the video compression effect.
  • this application provides a video decoding method, including:
  • a reconstructed image of the current image is determined.
  • embodiments of the present application provide a video encoding method, including:
  • the quantized first feature information is encoded to obtain the first code stream.
  • the present application provides a video encoder for performing the method in the above first aspect or its respective implementations.
  • the encoder includes a functional unit for performing the method in the above-mentioned first aspect or its respective implementations.
  • the present application provides a video decoder for performing the method in the above second aspect or various implementations thereof.
  • the decoder includes a functional unit for performing the method in the above-mentioned second aspect or its respective implementations.
  • a video encoder including a processor and a memory.
  • the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method in the above first aspect or its respective implementations.
  • a sixth aspect provides a video decoder, including a processor and a memory.
  • the memory is used to store a computer program
  • the processor is used to call and run the computer program stored in the memory to execute the method in the above second aspect or its respective implementations.
  • a seventh aspect provides a video encoding and decoding system, including a video encoder and a video decoder.
  • the video encoder is used to perform the method in the above-mentioned first aspect or its various implementations
  • the video decoder is used to perform the method in the above-mentioned second aspect or its various implementations.
  • An eighth aspect provides a chip for implementing any one of the above-mentioned first to second aspects or the method in each implementation manner thereof.
  • the chip includes: a processor, configured to call and run a computer program from a memory, so that the device installed with the chip executes any one of the above-mentioned first to second aspects or implementations thereof. method.
  • a ninth aspect provides a computer-readable storage medium for storing a computer program that causes a computer to execute any one of the above-mentioned first to second aspects or the method in each implementation thereof.
  • a computer program product including computer program instructions, which enable a computer to execute any one of the above-mentioned first to second aspects or the methods in each implementation thereof.
  • An eleventh aspect provides a computer program that, when run on a computer, causes the computer to execute any one of the above-mentioned first to second aspects or the method in each implementation thereof.
  • a twelfth aspect provides a code stream, including a code stream generated by any aspect of the second aspect.
  • this application performs multi-level time domain fusion on the quantized first feature information, that is, the quantified first feature information is not only combined with the features of the previous reconstructed image of the current image Information is fused, and the quantized first feature information is feature fused with multiple reconstructed images before the current image. This can avoid that when certain information in the previous reconstructed image of the current image is occluded, the occluded information can be retrieved from It is obtained from several reconstructed images before the current image, so that the generated hybrid spatio-temporal representation includes more accurate, rich and detailed feature information.
  • Figure 1 is a schematic block diagram of a video encoding and decoding system related to an embodiment of the present application
  • Figure 2 is a schematic flow chart of a video decoding method provided by an embodiment of the present application.
  • Figure 3 is a schematic network structure diagram of the inverse transformation module involved in the embodiment of the present application.
  • Figure 4 is a schematic network structure diagram of the recursive aggregation module involved in the embodiment of the present application.
  • Figure 5 is a schematic network structure diagram of the first decoder involved in the embodiment of the present application.
  • Figure 6 is a schematic network structure diagram of the second decoder involved in the embodiment of the present application.
  • Figure 7 is a schematic network structure diagram of the third decoder involved in the embodiment of the present application.
  • Figure 8 is a schematic network structure diagram of the fourth decoder involved in the embodiment of the present application.
  • Figure 9 is a schematic network structure diagram of a neural network-based decoder according to an embodiment of the present application.
  • Figure 10 is a schematic diagram of a video decoding process provided by an embodiment of the present application.
  • Figure 11 is a schematic flow chart of a video encoding method provided by an embodiment of the present application.
  • Figure 12 is a schematic network structure diagram of a neural network-based encoder according to an embodiment of the present application.
  • Figure 13 is a schematic diagram of the video encoding process provided by an embodiment of the present application.
  • Figure 14 is a schematic block diagram of a video decoding device provided by an embodiment of the present application.
  • Figure 15 is a schematic block diagram of a video encoding device provided by an embodiment of the present application.
  • Figure 16 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • Figure 17 is a schematic block diagram of a video encoding system provided by an embodiment of the present application.
  • This application can be applied to the fields of image encoding and decoding, video encoding and decoding, hardware video encoding and decoding, dedicated circuit video encoding and decoding, real-time video encoding and decoding, etc.
  • the solution of this application can be operated in conjunction with other proprietary or industry standards, including ITU-TH.261, ISO/IECMPEG-1Visual, ITU-TH.262 or ISO/IECMPEG-2Visual, ITU-TH.263 , ISO/IECMPEG-4Visual, ITU-TH.264 (also known as ISO/IECMPEG-4AVC), including scalable video codec (SVC) and multi-view video codec (MVC) extensions.
  • SVC scalable video codec
  • MVC multi-view video codec
  • FIG. 1 For ease of understanding, the video encoding and decoding system involved in the embodiment of the present application is first introduced with reference to FIG. 1 .
  • Figure 1 is a schematic block diagram of a video encoding and decoding system related to an embodiment of the present application. It should be noted that Figure 1 is only an example, and the video encoding and decoding system in the embodiment of the present application includes but is not limited to what is shown in Figure 1 .
  • the video encoding and decoding system 100 includes an encoding device 110 and a decoding device 120 .
  • the encoding device is used to encode the video data (which can be understood as compression) to generate a code stream, and transmit the code stream to the decoding device.
  • the decoding device decodes the code stream generated by the encoding device to obtain decoded video data.
  • the encoding device 110 in the embodiment of the present application can be understood as a device with a video encoding function
  • the decoding device 120 can be understood as a device with a video decoding function. That is, the embodiment of the present application includes a wider range of devices for the encoding device 110 and the decoding device 120. Examples include smartphones, desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, video game consoles, vehicle-mounted computers, and the like.
  • the encoding device 110 may transmit the encoded video data (eg, code stream) to the decoding device 120 via the channel 130 .
  • Channel 130 may include one or more media and/or devices capable of transmitting encoded video data from encoding device 110 to decoding device 120 .
  • channel 130 includes one or more communication media that enables encoding device 110 to transmit encoded video data directly to decoding device 120 in real time.
  • encoding device 110 may modulate the encoded video data according to the communication standard and transmit the modulated video data to decoding device 120.
  • the communication media includes wireless communication media, such as radio frequency spectrum.
  • the communication media may also include wired communication media, such as one or more physical transmission lines.
  • channel 130 includes a storage medium that can store video data encoded by encoding device 110 .
  • Storage media include a variety of local access data storage media, such as optical disks, DVDs, flash memories, etc.
  • the decoding device 120 may obtain the encoded video data from the storage medium.
  • channel 130 may include a storage server that may store video data encoded by encoding device 110 .
  • the decoding device 120 may download the stored encoded video data from the storage server.
  • the storage server may store the encoded video data and may transmit the encoded video data to the decoding device 120, such as a web server (eg, for a website), a File Transfer Protocol (FTP) server, etc.
  • FTP File Transfer Protocol
  • the encoding device 110 includes a video encoder 112 and an output interface 113.
  • the output interface 113 may include a modulator/demodulator (modem) and/or a transmitter.
  • the encoding device 110 may include a video source 111 in addition to the video encoder 112 and the input interface 113 .
  • Video source 111 may include at least one of a video capture device (eg, a video camera), a video archive, a video input interface for receiving video data from a video content provider, a computer graphics system Used to generate video data.
  • a video capture device eg, a video camera
  • a video archive e.g., a video archive
  • video input interface for receiving video data from a video content provider
  • computer graphics system Used to generate video data.
  • the video encoder 112 encodes the video data from the video source 111 to generate a code stream.
  • Video data may include one or more images (pictures) or sequence of pictures (sequence of pictures).
  • the code stream contains the encoding information of an image or image sequence in the form of a bit stream.
  • Encoded information may include encoded image data and associated data.
  • the associated data may include sequence parameter set (SPS), picture parameter set (PPS) and other syntax structures.
  • SPS sequence parameter set
  • PPS picture parameter set
  • An SPS can contain parameters that apply to one or more sequences.
  • a PPS can contain parameters that apply to one or more images.
  • a syntax structure refers to a collection of zero or more syntax elements arranged in a specified order in a code stream.
  • the video encoder 112 transmits the encoded video data directly to the decoding device 120 via the output interface 113 .
  • the encoded video data can also be stored on a storage medium or storage server for subsequent reading by the decoding device 120 .
  • decoding device 120 includes input interface 121 and video decoder 122.
  • the decoding device 120 may also include a display device 123.
  • the input interface 121 includes a receiver and/or a modem. Input interface 121 may receive encoded video data over channel 130.
  • the video decoder 122 is used to decode the encoded video data to obtain decoded video data, and transmit the decoded video data to the display device 123 .
  • the display device 123 displays the decoded video data.
  • Display device 123 may be integrated with decoding device 120 or external to decoding device 120 .
  • Display device 123 may include a variety of display devices, such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or other types of display devices.
  • LCD liquid crystal display
  • plasma display a plasma display
  • OLED organic light emitting diode
  • Figure 1 is only an example, and the technical solution of the embodiment of the present application is not limited to Figure 1.
  • the technology of the present application can also be applied to unilateral video encoding or unilateral video decoding.
  • the above-described video encoder 112 may be applied to image data in a luminance-chrominance (YCbCr, YUV) format.
  • YUV ratio can be 4:2:0, 4:2:2 or 4:4:4, Y represents brightness (Luma), Cb(U) represents blue chroma, Cr(V) represents red chroma, U and V represent Chroma, which is used to describe color and saturation.
  • 4:2:0 means that every 4 pixels have 4 luminance components and 2 chrominance components (YYYYCbCr)
  • 4:2:2 means that every 4 pixels have 4 luminance components and 4 Chroma component (YYYYCbCrCbCr)
  • 4:4:4 means full pixel display (YYYYCbCrCbCrCbCrCbCr).
  • the intra-frame prediction method is used in video encoding and decoding technology to eliminate the spatial redundancy between adjacent pixels. Since there is a strong similarity between adjacent frames in the video, the interframe prediction method is used in video coding and decoding technology to eliminate the temporal redundancy between adjacent frames, thereby improving coding efficiency.
  • the embodiments of the present application can be used for inter-frame coding to improve the efficiency of inter-frame coding.
  • Video encoding technology is mainly used for encoding serialized video data and is mainly used for data storage, transmission and presentation applications in the Internet era. Video currently occupies more than 85% of the traffic space and entrance. As users’ demands for video data resolution, frame rate, and dimensionality increase in the future, the role and value of video encoding technology will also increase significantly in the future. For The technological improvement and demand for video coding represent huge opportunities and challenges. Traditional video coding technology has experienced decades of development and transformation, and has greatly satisfied and served the world's video services in every era. Traditional video coding technology has been iteratively updated under the hybrid coding framework based on multi-scale block levels and is still used today.
  • deep learning technology especially deep neural network technology
  • the deep learning technology applied in the field of video coding initially focused on the research and replacement of traditional video coding neutron technology.
  • the training data trains the corresponding neural network and is used to replace the corresponding module after the final neural network converges.
  • the replaceable modules include in-loop filtering, out-of-loop filtering, coding block division, coding block prediction, etc.
  • the current video compression technology based on neural network has poor compression effect.
  • this application proposes a purely data-driven neural network coding framework, that is, the entire encoding and decoding system is designed, trained and ultimately used for video encoding based on deep neural networks, and adopts a new hybrid
  • the lossy motion expression method implements inter-frame coding and decoding technology based on neural networks.
  • FIG. 2 is a schematic flowchart of a video decoding method provided by an embodiment of the present application.
  • the embodiment of the present application is applied to the video decoder shown in FIG. 1 .
  • the method in the embodiment of this application includes:
  • the first feature information is obtained by feature fusion of the current image and the previous reconstructed image of the current image.
  • An embodiment of the present application proposes a neural network-based decoder, which is obtained through end-to-end training of the neural network-based decoder and the neural network-based encoder.
  • the previous reconstructed image of the current image can be understood as the previous frame image located before the current image in the video sequence, and the previous frame image has been decoded and reconstructed.
  • the encoding end Since there is a strong similarity between the two adjacent frames of the current image and the previous reconstructed image of the current image, the encoding end performs feature fusion on the current image and the previous reconstructed image of the current image during encoding. Obtain the first characteristic information. For example, the encoding end concatenates the current image and the previous reconstructed image of the current image, and performs feature extraction on the concatenated image to obtain the first feature information. For example, the encoding end uses a feature extraction module to extract features from the concatenated images to obtain the first feature information. This application does not limit the specific network structure of the feature extraction module.
  • the first feature information obtained above is of floating point type, for example, represented by a 32-bit floating point number.
  • the encoding end quantizes the first feature information obtained above to obtain the quantized first feature information. . Then, the quantized first feature information is encoded to obtain the first code stream. For example, the encoding end performs arithmetic coding on the first feature information to obtain the first code stream. In this way, after the decoding end obtains the first code stream, it decodes the first code stream to obtain the quantized first feature information, and obtains the reconstructed image of the current image based on the quantized first feature information.
  • the decoding end in S201 decodes the first code stream and determines the quantized first feature information, including but not limited to the following:
  • Method 1 If the encoding end directly uses the probability distribution of the quantized first feature information to encode the quantized first feature information, the first code stream is obtained. Correspondingly, the decoding end directly decodes the first code stream to obtain the quantized first feature information.
  • the above-mentioned quantized first feature information includes a large amount of redundant information.
  • the encoding end performs feature transformation according to the first feature information to obtain the second feature information, quantizes the second feature information and then encodes it to obtain the second code stream;
  • the code stream is decoded to obtain the quantized second feature information, and the probability distribution of the quantized first feature information is determined based on the quantized second feature information; and then based on the probability distribution of the quantized first feature information, the The quantized first feature information is encoded to obtain a first code stream.
  • the encoding end determines the super-prior feature information corresponding to the first feature information, that is, the second feature information, and determines the probability distribution of the quantized first feature information based on the second feature information, Since the second feature information is the super-prior feature information of the first feature information and contains less redundancy, the probability distribution of the quantized first feature information is determined based on the second feature information with less redundancy. , and using this probability distribution to encode the first feature information can reduce the encoding cost of the first feature information.
  • the decoder can determine the quantized first feature information through the steps of the following method two.
  • Method 2 The above S201 includes the following steps from S201-A to S201-C:
  • the second feature information is obtained by performing feature transformation on the first feature information.
  • the encoding end performs feature transformation on the first feature information to obtain the super-prior feature information of the first feature information, that is, the second feature information, and uses the second feature information to determine the quantized third a probability distribution of feature information, and use the probability distribution to encode the quantized first feature information to obtain the first code stream.
  • the above-mentioned second feature information is encoded to obtain the second code stream. That is to say, in the second method, the encoding end generates two code streams, which are the first code stream and the second code stream.
  • the decoder after the decoder obtains the first code stream and the second code stream, it first decodes the second code stream and determines the probability distribution of the quantized first feature information. Specifically, it decodes the second code stream and obtains the quantized second feature. information, and determine the probability distribution of the quantized first characteristic information based on the quantized second characteristic information. Then, the decoding end uses the determined probability distribution to decode the first code stream to obtain the quantized first feature information, thereby achieving accurate decoding of the first feature information.
  • the encoding end can directly use the quantized probability of the second feature information when encoding. Distribute, encode the quantized second feature information, and obtain the second code stream. Correspondingly, when decoding, the decoding end directly decodes the second code stream to obtain the quantized second feature information.
  • the decoder After determining the quantized second feature information according to the above steps, the decoder determines the probability distribution of the quantized first feature information based on the quantized second feature information.
  • This embodiment of the present application does not limit the specific method of determining the probability distribution of the quantized first feature information based on the quantized second feature information in the above S201-B.
  • S201-B since the above-mentioned second feature information is obtained by performing feature transformation on the first feature information, based on this, S201-B includes the following steps from S201-B1 to S201-B3:
  • the decoder performs inverse transformation on the quantized second feature information to obtain reconstructed feature information, where the inverse transformation method used by the decoder can be understood as the inverse operation of the transformation method used by the encoding end.
  • the encoding end performs N times of feature extraction on the first feature information to obtain the second feature information.
  • the decoding end performs N times of reverse feature extraction on the quantized second feature information to obtain the inversely transformed feature information. , recorded as reconstructed feature information.
  • the embodiment of the present application does not limit the inverse transformation method used by the decoding end.
  • the inverse transformation method used at the decoding end includes N times of feature extraction. That is to say, the decoder performs N times of feature extraction on the obtained quantized second feature information to obtain reconstructed feature information.
  • the inverse transformation method adopted by the decoder includes N times of feature extraction and N times of upsampling. That is to say, the decoder performs N times of feature extraction and N times of upsampling on the obtained quantized second feature information to obtain reconstructed feature information.
  • the embodiments of the present application do not limit the specific execution order of the above-mentioned N times of feature extraction and N times of upsampling.
  • the decoder may first perform N consecutive feature extractions on the quantized second feature information, and then perform N consecutive upsamplings.
  • the above-mentioned N times of feature extraction and N times of upsampling are interspersed, that is, one time of feature extraction is performed and one time of upsampling is performed.
  • the decoder performs inverse transformation on the quantized second feature information, and the specific process of obtaining the reconstructed feature information is: input the quantized second feature information into the first feature extraction module for the first feature extraction module.
  • feature information 1 is obtained, feature information 1 is upsampled to obtain feature information 2, feature information 2 is input into the second feature extraction module for the second feature extraction, and feature information 3 is obtained.
  • 3 performs upsampling to obtain feature information 4, which is recorded as reconstructed feature information.
  • the embodiments of the present application do not limit the N-times feature extraction methods used by the decoder, which include, for example, at least one of multi-layer convolution, residual connection, dense connection and other feature extraction methods.
  • the decoder performs feature extraction through non-local attention.
  • the above S201-B1 includes the following steps of S201-B11:
  • the decoder uses the non-local attention method to extract the quantized Feature extraction is performed on the second feature information to achieve fast and accurate feature extraction of the quantized second feature information.
  • the encoding end when the encoding end generates the second feature information based on the first feature information, it performs N times of down-sampling. Therefore, the decoding end performs N times of up-sampling correspondingly, so that the reconstructed feature information obtained by reconstruction is consistent with the first feature information. Same size.
  • the decoder obtains reconstructed feature information through an inverse transformation module, which includes N non-local attention modules and N upsampling modules.
  • the non-local attention module is used to implement non-local attention transformation
  • the up-sampling module is used to implement up-sampling.
  • an upsampling module is connected after a non-local attention module.
  • the decoding end inputs the decoded quantized second feature information into the inverse transformation module, and the first non-local attention module in the inverse transformation module performs non-local attention on the quantized second feature information.
  • feature information 1 is obtained, and then feature information 1 is input into the first upsampling module for upsampling, and feature information 2 is obtained. Then, feature information 2 is input into the second non-local attention module for non-local attention feature transformation extraction to obtain feature information 3, and then feature information 3 is input into the second upsampling module for upsampling to obtain feature information 4.
  • feature information output by the Nth upsampling module is obtained, and the feature information is determined as the reconstructed feature information.
  • the second quantized feature information is obtained by transforming the first feature information.
  • the decoding end performs inverse quantization on the quantized second feature information through the above steps to obtain reconstructed feature information. Therefore, the reconstructed feature information can be It is understood as the reconstructed information of the first feature information, that is to say, the probability distribution of the reconstructed feature information is similar or related to the probability distribution of the quantized first feature information.
  • the decoder can first determine the probability distribution of the reconstructed feature information, Then, based on the probability distribution of the reconstructed feature information, the probability distribution of the quantized first feature information is predicted.
  • the probability distribution of the reconstructed feature information is a normal distribution or a Gaussian distribution.
  • the process of determining the probability distribution of the reconstructed feature information is to determine the probability distribution of the reconstructed feature information based on each feature value in the reconstructed feature information.
  • the mean and variance matrices generate a Gaussian distribution of the reconstructed feature information based on the mean and variance matrices.
  • S201-B3 Predict the probability distribution of the quantized first feature information based on the probability distribution of the reconstructed feature information.
  • the embodiment of the present application can use the probability distribution of the reconstructed feature information to Achieve accurate prediction of the probability distribution of the quantized first feature information.
  • the probability distribution of the reconstructed feature information is determined as the probability distribution of the quantized first feature information.
  • the probability distribution of the reconstructed feature information predict the probability of encoding pixels in the quantized first feature information; according to the probability of encoding pixels in the quantized first feature information, obtain the quantized The probability distribution of the first feature information.
  • S201-C Decode the first code stream according to the probability distribution of the quantized first feature information to obtain the quantized first feature information.
  • the probability distribution is used to decode the first code stream, thereby achieving accurate decoding of the quantized first feature information.
  • the decoding end decodes the first code stream according to the above-mentioned method 1 or 2, and after determining the quantized first feature information, performs the following steps of S202.
  • multi-level time domain fusion is performed on the quantized first feature information, that is, the quantized first feature information is not only combined with the feature information of the previous reconstructed image of the current image Fusion is performed, and the quantized first feature information is feature fused with multiple reconstructed images before the current image, for example, the reconstructed images at multiple times such as t-1 time, t-2 time..., t-k time, etc. are merged with the quantized The first feature information is fused.
  • the occluded information can be obtained from several reconstructed images before the current image, thereby making the generated hybrid spatiotemporal representation include more accurate, rich and detailed Feature information.
  • the accuracy of the generated predicted images can be improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate predicted images. , thereby improving the video compression effect.
  • the embodiments of this application do not limit the specific method by which the decoder performs multi-level time domain fusion on the quantized first feature information to obtain the hybrid spatiotemporal representation.
  • the decoding end mixes spatiotemporal representations through a recursive aggregation module, that is, the above S202 includes the following steps of S202-A:
  • the decoder uses the recursive aggregation module to fuse the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment to obtain a hybrid spatiotemporal representation.
  • the recursive aggregation module of the embodiment of the present application will learn and retain the deep-level feature information learned from this feature information each time it generates a mixed spatio-temporal representation, and use the learned deep-level features as implicit feature information. to generate the next mixed spatio-temporal representation, thereby improving the accuracy of the generated mixed spatio-temporal representation. That is to say, in the embodiment of this application, the implicit feature information of the recursive aggregation module at the previous moment includes the feature information of multiple reconstructed images before the current image learned by the recursive aggregation module.
  • the decoder uses the recursive aggregation module to Fusing the quantified first feature information with the implicit feature information of the recursive aggregation module at the previous moment can generate a more accurate, rich and detailed hybrid spatio-temporal representation.
  • the embodiments of this application do not limit the specific network structure of the recursive aggregation module, for example, it can be any network structure that can realize the above functions.
  • the recursive aggregation module is stacked by at least one spatiotemporal recursive network ST-LSTM.
  • the expression formula of the above hybrid spatiotemporal representation Gt is as shown in formula (1):
  • h is the implicit feature information included in ST-LSTM.
  • the decoder will reconstruct the obtained quantized first feature information Input into the recursive aggregation module, and the two ST-LSTMs in the recursive aggregation module sequentially compare the quantized first feature information.
  • Process to generate a piece of feature information Specifically, as shown in Figure 4, the implicit feature information h1 generated by the first ST-LSTM is used as the input of the next ST-LSTM, and the two ST-LSTMs are used in this operation process.
  • the update values c1 and c2 of the transmission belt are respectively generated in to update the respective transmission belt values, where m is the memory information, which is transferred between the two ST-LSTM, and finally the feature information output by the second ST-LSTM is obtained. h2. Furthermore, in order to improve the accuracy of the generated hybrid spatio-temporal representation, the feature information h2 generated by the second ST-LSTM is combined with the quantized first feature information Perform residual connection, that is, the feature information h generated by the second ST-LSTM and the first quantized feature information The addition is performed to generate a mixed spatiotemporal representation Gt.
  • the decoding end After obtaining the mixed spatiotemporal representation according to the above method, the decoding end performs the following S203.
  • P is a positive integer.
  • the hybrid spatio-temporal representation in the embodiment of the present application fuses the current image and the feature information of multiple reconstructed images before the current image.
  • the previous reconstructed image is motion compensated according to the hybrid spatio-temporal representation, and an accurate current image can be obtained. P predicted images.
  • the embodiment of the present application does not place a limit on the specific number of P predicted images generated. That is, in the embodiment of this application, the decoder can use different methods to perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation, and obtain P predicted images of the current image.
  • the embodiments of the present application do not limit the specific manner in which the decoder performs motion compensation on the previous reconstructed image based on the mixed spatiotemporal representation.
  • the P predicted images include a first predicted image, which is obtained by the decoder using optical flow motion compensation.
  • the above S203 includes the following S203-A1 and S203-A2. step:
  • the decoder obtains optical flow motion information through a pre-trained neural network model, that is, the neural network model can predict optical flow motion information based on mixed spatiotemporal representation.
  • the neural network model may be called a first decoder, or optical flow signal decoder Df.
  • the decoding end inputs the mixed spatio-temporal representation Gt into the optical flow signal decoder Df to predict the optical flow motion information, and obtains the optical flow motion information f x,y output by the optical flow signal decoder Df.
  • the f x, y is the optical flow motion information of channel 2.
  • the optical flow signal decoder Df is composed of multiple NLAMs and multiple upsampling modules.
  • the optical flow signal decoder Df includes 1 NLAM, 3 LAMs and 4 There are two downsampling modules, one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module.
  • NLAM includes multiple convolutional layers, for example, 3 convolutional layers, the convolutional kernel size of each convolutional layer is 3*3, and the number of channels is 192.
  • the three LAMs each include multiple convolutional layers.
  • each of the three LAMs includes three convolutional layers.
  • the convolution kernel size of each convolutional layer is 3*3.
  • the channels of the convolutional layers included in the three LAMs are The numbers are 128, 96 and 64 respectively.
  • the four down-sampling modules each include a convolution layer Conv.
  • the convolution kernel size of the convolution layer is 5*5.
  • the number of channels of the convolution layer included in the four down-sampling modules is 128 and 96 respectively. , 64 and 2. In this way, the decoder inputs the mixed spatio-temporal representation Gt into the optical flow signal decoder Df.
  • NLAM performs feature extraction on the spatio-temporal representation Gt, obtains a feature information a with a channel number of 192, and inputs the feature information a into the first Downsampling is performed in the downsampling module to obtain feature information b with a channel number of 128. Then, the feature information b is input into the first LAM for feature re-extraction, and the feature information c with the number of channels is 128, and the feature information c is input into the second down-sampling module for down-sampling, and the number of channels is 96. Characteristic information d.
  • the feature information d is input into the second LAM for feature re-extraction, and the feature information e with the number of channels is 96, and the feature information e is input into the third down-sampling module for down-sampling, and the number of channels is 64.
  • Characteristic information f is input into the third LAM for feature re-extraction, and the feature information g with a channel number of 64 is obtained.
  • the feature information g is input into the fourth downsampling module for downsampling, and the channel number is 2.
  • the feature information j is the optical flow motion information.
  • the decoder After the decoder generates the optical flow motion information f x, y , it uses the optical flow motion information f x, y to reconstruct the previous image. Motion compensation is performed to obtain the first predicted image X 1 .
  • the embodiments of this application do not limit the specific method by which the decoder performs motion compensation on the previous reconstructed image based on the optical flow motion information to obtain the first predicted image.
  • the decoder uses the optical flow motion information f x, y to perform motion compensation on the previous reconstructed image.
  • image Linear interpolation is performed, and the image generated by the interpolation is recorded as the first predicted image X 1 .
  • the decoder obtains the first predicted image X 1 through the following formula (3):
  • the decoder uses the optical flow motion information fxy to reconstruct the previous image through the Warping operation. Motion compensation is performed to obtain the first predicted image X 1 .
  • the P predicted images include a second predicted image, which is obtained by the decoder using offset motion compensation.
  • the above S203 includes the following S203-B1 to S203-B3. step:
  • S203-B3 Use the offset to perform motion compensation on the reference feature information to obtain the second predicted image.
  • the decoder obtains the offset corresponding to the current image through a pre-trained neural network model. That is, the neural network model can predict the offset based on the mixed spatiotemporal representation.
  • the offset is lossy. offset information.
  • the neural network model may be called the second decoder, or variable convolutional decoder Dm.
  • the decoding end inputs the mixed spatio-temporal representation Gt into the variable convolution decoder Dm to predict the offset information.
  • the decoder performs spatial feature extraction on the previous reconstructed image to obtain reference feature information.
  • the decoder uses the spatial feature extraction module SFE to extract spatial features from the previous reconstructed image to obtain reference feature information.
  • the decoder uses the offset to perform motion compensation on the extracted reference feature information to obtain a second predicted image of the current image.
  • Embodiments of the present application do not limit the specific manner in which the decoder uses the offset to perform motion compensation on the extracted reference feature information to obtain the second predicted image of the current image.
  • the decoder uses the offset to perform motion compensation based on deformable convolution on the reference feature information to obtain the second predicted image.
  • the decoder inputs the mixed spatio-temporal representation Gt and the reference feature information into the transformable convolution
  • the transformable convolution generates an offset corresponding to the current image based on the mixed spatiotemporal representation Gt, and the offset is applied to the reference feature information for motion compensation, thereby obtaining the second predicted image.
  • variable convolution decoder Dm in the embodiment of the present application includes a transformable convolution DCN, and the decoding end converts the previous reconstructed image into Input the inverse transformation module SFE for spatio-temporal feature extraction to obtain reference feature information.
  • the mixed spatio-temporal representation Gt and the reference feature information are input into the transformable convolution DCN for offset extraction and motion compensation to obtain the second predicted image X 2 .
  • the decoder generates the second predicted image X 2 through formula (4):
  • variable convolution decoder Dm in addition to the transformable convolution DCN, also includes 1 NLAM, 3 LAMs and There are 4 downsampling modules, one of which is connected to a downsampling module after an NLAM, and a downsampling module is connected to after a LAM.
  • the network structure of 1 NLAM, 3 LAMs and the first 3 downsampling modules included in the variable convolution decoder Dm is the same as the 1 NLAM, 3 LAMs and 1 NLAM included in the above-mentioned optical flow signal decoder Df.
  • the network structures of the first three downsampling modules are the same and will not be described again here.
  • the number of channels included in the last downsampling module included in the variable convolution decoder Dm is 5.
  • the decoder first converts the previous reconstructed image into Input the inverse transformation module SFE for spatio-temporal feature extraction to obtain reference feature information.
  • the mixed spatio-temporal representation Gt and the reference feature information are input into the transformable convolution DCN in the variable convolution decoder Dm to perform offset extraction and motion compensation to obtain a feature information, which is input into the NLAM , after feature extraction by NLAM, 3 LAMs and 4 downsampling modules, it is finally restored to the second predicted image X 2 .
  • the decoder can determine P predicted images, for example, determine the first predicted image and the second predicted image, and then perform the following steps of S204.
  • the reconstructed image of the current image is determined based on the predicted image.
  • the predicted image For example, compare the predicted image with one or several previous reconstructed images of the current image, and calculate the loss. If the loss is small, it means that the prediction accuracy of the predicted image is high, and the predicted image can be determined as the reconstruction of the current image. image.
  • the reconstructed image of the current image can be determined based on the previous one or several reconstructed images of the current image and the predicted image. For example, the reconstructed image of the current image can be determined.
  • the predicted image and one or several previous reconstructed images of the current image are input into a neural network to obtain a reconstructed image of the current image.
  • the above S204 includes the following steps of S204-A and S204-B:
  • the decoder first determines the target predicted image of the current image based on P predicted images, and then implements the reconstructed image of the current image based on the target predicted image of the current image, thereby improving the accuracy of determining the reconstructed image.
  • the embodiment of the present application does not limit the specific method of determining the target predicted image of the current image based on the P predicted images.
  • the one predicted image is determined as the target predicted image of the current image.
  • S204-A includes S204-A11 and S204-A12:
  • the P predicted images are weighted to generate a weighted image, then according to the Weight the image to obtain the target prediction image.
  • the embodiment of the present application does not limit the specific method of determining the weighted image based on the P predicted images.
  • the weights corresponding to P predicted images are determined; and the P predicted images are weighted according to the weights corresponding to the P predicted images to obtain weighted images.
  • the decoder determines the first weight corresponding to the first predicted image and the second weight corresponding to the second predicted image, and based on the first weight and the The second weight is used to weight the first predicted image and the second predicted image to obtain a weighted image.
  • the methods for determining the weights corresponding to the P predicted images include but are not limited to the following:
  • Method 2 The decoder performs adaptive masking based on the mixed spatiotemporal representation to obtain weights corresponding to P predicted images.
  • the decoder uses a neural network model to generate weights corresponding to P predicted images.
  • the neural network model is pre-trained and can be used to generate weights corresponding to P predicted images.
  • this neural network model is also called the third decoder or adaptive mask compensation decoder Dw .
  • the decoding end inputs the mixed spatio-temporal representation into the adaptive mask compensation decoder Dw to perform adaptive masking, and obtains the weights corresponding to the P predicted images.
  • the decoding end inputs the mixed spatio-temporal representation Gt into the adaptive mask compensation decoder D w for adaptive masking, and the adaptive mask compensation decoder D w outputs the first weight w1 and the second prediction of the first predicted image.
  • the second weight w2 of the image is used to obtain the first predicted image X 1 and the second predicted image X 2 based on the first weight w1 and the second weight w2, and the corresponding information representing different areas in the predicted frame can be adaptively selected, A weighted image is then generated.
  • the weighted image X 3 is generated according to the following formula (5):
  • the weights corresponding to the P predicted images are a matrix, including the weight corresponding to each pixel in the predicted image, so that when generating a weighted image, for each pixel in the current image, P The predicted value and weight corresponding to the pixel in each predicted image are weighted to obtain the weighted predicted value of the pixel, so that the weighted predicted value corresponding to each pixel in the current image constitutes the weighted image of the current image.
  • the embodiment of the present application does not limit the specific network structure of the above-mentioned adaptive mask compensation decoder D w .
  • the adaptive mask compensation decoder Dw includes 1 NLAM, 3 LAMs, 4 downsampling modules and a sigmoid function, where one NLAM is followed by a downsampling module, A downsampling module is connected after a LAM.
  • the adaptive mask compensation decoder Dw includes 1 NLAM, 3 LAMs, and 4 downsampling modules
  • the above-mentioned variable convolution decoder Dm includes 1 NLAM, 3 LAMs, and 4
  • the network structures of the two downsampling modules are the same and will not be described again here.
  • the decoder weights the P predicted images according to the above method, and after obtaining the weighted images, performs the following S204-A12.
  • the weighted image is determined as the target prediction image.
  • the decoder can also obtain the residual image of the current image based on the mixed spatiotemporal representation.
  • the decoder uses a neural network model to obtain the residual image of the current image.
  • the neural network model is pre-trained and can be used to generate the residual image of the current image.
  • this neural network model is also called the fourth decoder or spatial texture enhancement decoder Dt.
  • This residual image X r can perform the prediction image Texture enhancement.
  • the spatial texture enhancement decoder Dt includes 1 NLAM, 3 LAMs, and 4 downsampling modules, where one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module. module.
  • the spatial texture enhancement decoder Dt includes 1 NLAM, 3 LAMs, and the first 3 downsampling modules
  • the above-mentioned optical flow signal decoder Df includes 1 NLAM, 3 LAMs, and the first 3 downsampling modules.
  • the network structure of the sampling module is the same and will not be described again here.
  • the last downsampling module included in the spatial texture enhancement decoder Dt includes a channel number of 3.
  • determining the target predicted image of the current image based on the P predicted images in S204-A above includes the following steps of S204-A21:
  • a target predicted image is obtained based on the predicted image and the residual image. For example, the predicted image and the residual image are added to generate the target predicted image.
  • P is greater than 1, first determine the weighted image based on P predicted images; then determine the target predicted image based on the weighted image and the residual image.
  • the specific process of determining the weighted image by the decoding end based on the P predicted images can refer to the specific description of S204-A11 above, which will not be described again here.
  • the first weight w1 corresponding to the first predicted image and the second weight w2 corresponding to the second predicted image are determined.
  • the first weight w1 corresponding to the second predicted image is determined.
  • the first predicted image and the second predicted image are weighted to obtain a weighted image X 3 , and then the residual image X r is used to enhance the weighted image X 3 to obtain a target predicted image.
  • the target prediction image X 4 is generated according to the following formula (6):
  • the decoder determines the target prediction image of the current image, it performs the following steps of S204-B.
  • the target predicted image is compared with one or several previous reconstructed images of the current image, and the loss is calculated. If the loss is small, it means that the prediction accuracy of the target predicted image is high, and the target predicted image can be The image is determined to be the reconstructed image of the current image. If the above loss is large, it means that the prediction accuracy of the target prediction image is low.
  • the reconstructed image of the current image can be determined based on the previous one or several reconstructed images of the current image and the target prediction image. For example, the target The predicted image and one or several previous reconstructed images of the current image are input into a neural network to obtain a reconstructed image of the current image.
  • the embodiments of the present application also include residual decoding.
  • the above-mentioned S204-B includes the following steps of S204-B1 and S204-B2:
  • the encoding end in order to improve the effect of reconstructing the image, the encoding end also generates a residual code stream through residual coding. Specifically, the encoding end determines the residual value of the current image and encodes the residual value. Generate residual code stream. Correspondingly, the decoder decodes the residual code stream to obtain the residual value of the current image, and obtains the reconstructed image based on the target prediction image and residual value.
  • the embodiment of the present application does not limit the specific expression form of the residual value of the above-mentioned current image.
  • the residual value of the current image is a matrix, and each element in the matrix is the residual value corresponding to each pixel in the current image.
  • the decoder can add the residual value and prediction value corresponding to each pixel in the target prediction image pixel by pixel to obtain the reconstructed value of each pixel, and then obtain the reconstructed image of the current image.
  • the target prediction image the predicted value corresponding to the i-th pixel is obtained, and the residual value corresponding to the i-th pixel is obtained from the residual value of the current image.
  • the reconstruction value corresponding to each pixel in the current image can be obtained.
  • the reconstruction value corresponding to each pixel in the current image forms the reconstructed image of the current image. .
  • the embodiments of this application do not limit the specific way in which the decoding end obtains the residual value of the current image. That is to say, the embodiments of this application do not limit the residual encoding and decoding methods used by both encoding and decoding ends.
  • the encoding end determines the target predicted image of the current image in the same manner as the decoding end, and then obtains the residual value of the current image based on the current image and the target predicted image. For example, the current image and the target predicted image are obtained. The difference value of the target predicted image is determined as the residual value of the current image.
  • the residual value of the current image is encoded to generate a residual code.
  • the residual value of the current image can be transformed to obtain the transformation coefficient, the transformation coefficient can be quantized to obtain the quantized coefficient, and the quantized coefficient can be encoded to obtain the residual code stream.
  • the decoding end decodes the residual code stream to obtain the residual value of the current image.
  • the decoding end decodes the residual code stream to obtain the quantization coefficient, and performs inverse quantization and inverse transformation on the quantization coefficient to obtain the residual value of the current image. Then, according to the above method, the residual values corresponding to the target prediction image and the current image are added to obtain a reconstructed image of the current image.
  • the encoding end may use a neural network method to process the current image and the target predicted image of the current image, generate a residual value of the current image, encode the residual value of the current image, and generate a residual code stream.
  • the decoder decodes the residual code stream to obtain the residual value of the current image. Then, according to the above method, the residual values corresponding to the target prediction image and the current image are added to obtain the reconstructed image of the current image.
  • the decoding end can obtain the reconstructed image of the current image according to the above method.
  • the reconstructed image can be displayed directly.
  • the reconstructed image can also be stored in a cache for subsequent image decoding.
  • the decoding end determines the quantized first feature information by decoding the first code stream.
  • the first feature information is obtained by feature fusion of the current image and the previous reconstructed image of the current image; Perform multi-level time domain fusion on the quantized first feature information to obtain a mixed spatiotemporal representation; perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation, and obtain P predicted images of the current image, where P is a positive integer; according to P Predict the image and determine the reconstructed image of the current image.
  • the quantized first feature information is not only fused with the feature information of the previous reconstructed image of the current image, but also And perform feature fusion between the quantized first feature information and multiple reconstructed images before the current image. This can avoid that when certain information in the previous reconstructed image of the current image is occluded, the occluded information can be obtained from the previous reconstructed image. It is obtained from several reconstructed images, so that the generated hybrid spatio-temporal representation includes more accurate, rich and detailed feature information.
  • an end-to-end neural network-based encoding and decoding framework is proposed.
  • the neural network-based encoding and decoding framework includes a neural network-based encoder and a neural network-based decoder.
  • the decoding process of the embodiment of the present application is introduced below in conjunction with a possible neural network-based decoder of the present application.
  • Figure 9 is a schematic network structure diagram of a neural network-based decoder related to an embodiment of the present application, including: an inverse transformation module, a recursive aggregation module and a hybrid motion compensation module.
  • the inverse transformation module is used to perform inverse transformation on the quantized second feature information to obtain reconstructed feature information of the first feature information.
  • its network structure is shown in Figure 3.
  • the recursive aggregation module is used to perform multi-level time domain fusion on the quantized first feature information to obtain a hybrid spatio-temporal representation.
  • its network structure is shown in Figure 4.
  • the hybrid motion compensation module is used to perform hybrid motion compensation on the mixed spatio-temporal representation to obtain the target predicted image of the current image.
  • the hybrid motion compensation module may include the first decoder shown in Figure 5, and/or the first decoder shown in Figure 6 optionally, if the hybrid motion compensation module includes a first decoder and a second decoder, the hybrid motion compensation module may also include a third decoder shown in Figure 7 . In some embodiments, the hybrid motion compensation module may further include a fourth decoder as shown in FIG. 8 .
  • the embodiment of the present application takes the motion compensation module including a first decoder, a second decoder, a third decoder, and a fourth decoder as an example for description.
  • Figure 10 is a schematic diagram of the video decoding process provided by an embodiment of the present application. As shown in Figure 10, it includes:
  • the specific network structure of the inverse transformation module is shown in Figure 3, including 2 non-local self-attention modules and 2 upsampling modules.
  • the decoding end inputs the quantized second feature information into an inverse transformation module for inverse transformation, and the inverse transformation module outputs reconstructed feature information.
  • the inverse transformation module outputs reconstructed feature information.
  • S305 Decode the first code stream according to the probability distribution of the quantized first feature information to obtain the quantized first feature information.
  • the recursive aggregation module is stacked by at least one spatiotemporal recursive network.
  • the decoding end inputs the quantized first feature information into the recursive aggregation module, so that the recursive aggregation module fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment, and then outputs a mixed spatiotemporal representation.
  • the recursive aggregation module fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment, and then outputs a mixed spatiotemporal representation.
  • S307 Process the mixed spatiotemporal representation through the first decoder to obtain the first predicted image.
  • the mixed spatio-temporal representation and the previous reconstructed image are input into the hybrid motion compensation module for motion blend compensation to obtain the target prediction image of the current image.
  • the mixed spatio-temporal representation is processed by the first decoder to determine the optical flow motion information, and motion compensation is performed on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.
  • the network structure of the first decoder is shown in Figure 5.
  • S308 Process the mixed spatiotemporal representation through the second decoder to obtain a second predicted image.
  • SFE is used to extract spatial features from the previous reconstructed image to obtain the reference feature information; the reference feature information and the mixed spatio-temporal representation are input into the second decoder, so that the offset motion compensates the reference feature information to obtain the second prediction. image.
  • the network structure of the second decoder is shown in Figure 6.
  • the mixed spatio-temporal representation is input to the third decoder for adaptive masking to obtain the first weight corresponding to the first predicted image and the second weight corresponding to the second predicted image.
  • the network structure of the third decoder is shown in Figure 7.
  • the product of the first weight and the first predicted image is added to the product of the second weight and the second predicted image to obtain a weighted image.
  • the mixed spatiotemporal representation is input to the fourth decoder for processing to obtain the residual image of the current image.
  • the weighted image and the residual image are added together to determine the target prediction image.
  • multi-level time domain fusion is performed on the quantized first feature information, that is, the quantized first feature information is combined with the multi-level information before the current image.
  • Feature fusion is performed on the reconstructed images so that the generated hybrid spatio-temporal representation includes more accurate, rich and detailed feature information.
  • motion compensation is performed on the previous reconstructed image to generate multiple decoding information.
  • the multiple decoding information includes the first predicted image, the second predicted image, the first predicted image and the second predicted image respectively.
  • Weights, and residual images so that when the target prediction image of the current image is determined based on these multiple decoding information, the accuracy of the target prediction image can be effectively improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate prediction image, and then the reconstructed image of the current image can be accurately obtained. Improve video compression effect.
  • the video decoding method involved in the embodiment of the present application is described above. On this basis, the video encoding method involved in the present application is described below with respect to the encoding end.
  • FIG 11 is a schematic flowchart of a video encoding method provided by an embodiment of the present application.
  • the execution subject of the embodiment of the present application may be the encoder shown in Figure 1 above.
  • the method in the embodiment of this application includes:
  • the embodiment of the present application proposes an encoder based on a neural network, which is obtained through end-to-end training of the encoder based on the neural network and the decoder based on the neural network.
  • the previous reconstructed image of the current image can be understood as the previous frame image located before the current image in the video sequence, and the previous frame image has been decoded and reconstructed.
  • the encoding end when encoding, the encoding end combines the current image X t and the previous reconstructed image of the current image. Perform feature fusion to obtain the first feature information. For example, the encoding end combines the current image X t and the previous reconstructed image of the current image Perform cascading passes between channels Get the cascaded input data X cat , and X t are 3-channel video frame inputs in the SRGB domain. Then, feature extraction is performed on the concatenated image X cat to obtain the first feature information.
  • the embodiments of this application do not limit the specific manner in which the encoding end performs feature extraction on X cat .
  • it includes at least one of feature extraction methods such as multi-layer convolution, residual connection, and dense connection.
  • the encoding end performs Q times of non-local attention transformation and Q times of downsampling on the concatenated image to obtain the first feature information, where Q is a positive integer.
  • the encoding end inputs the cascaded 6-channel high-dimensional input signal X cat into a spatiotemporal feature extraction module (Spatiotemporal Feature Extraction, STFE) for multi-layer feature transformation and extraction.
  • a spatiotemporal feature extraction module Spatiotemporal Feature Extraction, STFE
  • the spatiotemporal feature extraction module includes Q non-local attention modules and Q downsampling modules.
  • the non-local attention module is used to implement non-local attention transformation
  • the down-sampling module is used to implement down-sampling.
  • a downsampling module is connected after a non-local attention module.
  • the encoding end inputs the cascaded 6-channel high-dimensional input signal X cat into STFE.
  • the first non-local attention module in STFE performs non-local attention feature transformation extraction on X cat to obtain feature information. 11.
  • the feature information 12 is input into the second non-local attention module for non-local attention feature transformation extraction to obtain the feature information 13, and then the feature information 13 is input into the second down-sampling module for down-sampling to obtain the feature information 14.
  • the feature information output by the Q-th downsampling module is obtained, and the feature information is determined as the first feature information X F .
  • the first feature information obtained above is of floating point type, for example, represented by a 32-bit floating point number. Furthermore, in order to reduce the encoding cost, the encoding end quantizes the first feature information obtained above to obtain the quantized first feature information. .
  • the encoding end uses the rounding function Round(.) to quantize the first feature information.
  • the first feature information is quantified using the method shown in the following formula (7):
  • U(-0.5,0.5) is a uniform noise distribution of plus or minus 0.5, which is used to approximate the actual rounding quantization function Round(.).
  • formula (7) is derived to obtain the corresponding backpropagation gradient of 1, which is used as the gradient of backpropagation to update the model.
  • Method 1 The encoding end directly uses the probability distribution of the quantized first feature information to encode the quantized first feature information to obtain the first code stream.
  • the above-mentioned quantized first feature information includes a large amount of redundant information.
  • the encoding end performs feature transformation according to the first feature information to obtain the second feature information, quantizes the second feature information and then encodes it to obtain the second code stream;
  • the code stream is decoded to obtain the quantized second feature information, and the probability distribution of the quantized first feature information is determined based on the quantized second feature information; and then based on the probability distribution of the quantized first feature information, the The quantized first feature information is encoded to obtain a first code stream.
  • the encoding end determines the super-prior feature information corresponding to the first feature information, that is, the second feature information, and determines the probability distribution of the quantized first feature information based on the second feature information, Since the second feature information is the super-prior feature information of the first feature information and contains less redundancy, the probability distribution of the quantized first feature information is determined based on the second feature information with less redundancy. , and using this probability distribution to encode the first feature information can reduce the encoding cost of the first feature information.
  • the encoding end can encode the quantized first feature information through the steps of the following method 2 to obtain the first code stream.
  • Method 2 The above S403 includes the following steps S403-A1 to S403-A4:
  • the encoding end performs feature transformation on the first feature information to obtain the super-a priori feature information of the first feature information, that is, the second feature information, and uses the second feature information to determine the quantized a probability distribution of the first feature information, and use the probability distribution to encode the quantized first feature information to obtain a first code stream.
  • the above-mentioned second feature information is encoded to obtain the second code stream. That is to say, in the second method, the encoding end generates two code streams, which are the first code stream and the second code stream.
  • the encoding end performs feature transformation according to the first feature information
  • the methods for obtaining the second feature information include but are not limited to the following:
  • Method 1 Perform N times of non-local attention transformation and N times of downsampling on the first feature information to obtain the second feature information.
  • Method 2 Perform N times of non-local attention transformation and N times of downsampling on the quantized first feature information to obtain the second feature information.
  • the encoding end can perform N times of non-local attention transformation and N times of downsampling on the first feature information or the quantized first feature information to obtain the second feature information.
  • the second feature information is quantified to obtain the quantized second feature information; the probability distribution of the quantized second feature information is determined; and the quantized second feature information is calculated according to the probability distribution of the quantized second feature information.
  • the information is encoded to obtain the second code stream.
  • the encoding end directly uses the quantized probability distribution of the second feature information when encoding. , encoding the quantized second feature information to obtain the second code stream.
  • S403-A3 Decode the second code stream to obtain the quantized second feature information, and determine the probability distribution of the quantized first feature information based on the quantized second feature information.
  • the encoding end performs arithmetic decoding on the super-a priori second code stream and restores the quantized super-a priori spatio-temporal characteristics. That is, the quantized second feature information is then used to determine the probability distribution of the quantized first feature information based on the quantized second feature information, and then the quantized first feature information is determined based on the probability distribution of the quantized first feature information.
  • the characteristic information is encoded to obtain the first code stream.
  • determining the probability distribution of the quantized first feature information includes the following steps:
  • the encoding end performs inverse transformation on the quantized second feature information to obtain reconstructed feature information, where the inverse transformation method used by the encoding end can be understood as the inverse operation of the transformation method used by the encoding end.
  • the encoding end performs N times of feature extraction on the first feature information to obtain the second feature information.
  • the encoding end performs N times of reverse feature extraction on the quantized second feature information to obtain the inversely transformed Feature information is recorded as reconstructed feature information.
  • the embodiment of the present application does not limit the inverse transformation method adopted by the encoding end.
  • the inverse transformation method used by the encoding end includes N times of feature extraction. That is to say, the encoding end performs N times of feature extraction on the obtained quantized second feature information to obtain reconstructed feature information.
  • the inverse transformation method adopted by the encoding end includes N times of feature extraction and N times of upsampling. That is to say, the encoding end performs N times of feature extraction and N times of upsampling on the obtained quantized second feature information to obtain reconstructed feature information.
  • the embodiments of the present application do not limit the specific execution order of the above-mentioned N times of feature extraction and N times of upsampling.
  • the encoding end may first perform N consecutive feature extractions on the quantized second feature information, and then perform N consecutive upsamplings.
  • the above-mentioned N times of feature extraction and N times of upsampling are interspersed, that is, one time of feature extraction is performed and one time of upsampling is performed.
  • the embodiments of the present application do not limit the N-times feature extraction methods used by the encoding end, which include, for example, at least one of feature extraction methods such as multi-layer convolution, residual connection, and dense connection.
  • the encoding end performs N times of non-local attention transformation and N times of upsampling on the quantized second feature information to obtain reconstructed feature information, where N is a positive integer.
  • the encoding end uses the non-local attention method to extract the quantized Feature extraction is performed on the second feature information to achieve fast and accurate feature extraction of the quantized second feature information.
  • the encoding end when the encoding end generates the second feature information based on the first feature information, it performs N times of down-sampling. Therefore, at this time, the encoding end performs N times of up-sampling during the inverse transformation, so that the reconstructed features can be reconstructed
  • the size of the information is consistent with the first feature information.
  • the encoding end obtains reconstructed feature information through an inverse transformation module, which includes N non-local attention modules and N upsampling modules.
  • the second quantized feature information is obtained by transforming the first feature information.
  • the encoding end performs inverse quantization on the quantized second feature information through the above steps to obtain reconstructed feature information. Therefore, the reconstructed feature information can be It is understood as the reconstructed information of the first feature information, that is to say, the probability distribution of the reconstructed feature information is similar or related to the probability distribution of the quantized first feature information. In this way, the encoding end can first determine the probability distribution of the reconstructed feature information, Then, based on the probability distribution of the reconstructed feature information, the probability distribution of the quantized first feature information is predicted.
  • the probability distribution of the reconstructed feature information is a normal distribution or a Gaussian distribution.
  • the process of determining the probability distribution of the reconstructed feature information is to determine the probability distribution of the reconstructed feature information based on each feature value in the reconstructed feature information.
  • the mean and variance matrices generate a Gaussian distribution of the reconstructed feature information based on the mean and variance matrices.
  • S403-A33 Determine the probability distribution of the quantized first feature information based on the probability distribution of the reconstructed feature information.
  • the probability of the coded pixels in the quantized first feature information is predicted; based on the probability of the coded pixels in the quantized first feature information, the probability distribution of the quantized first feature information is obtained.
  • S403-A4 Encode the quantized first feature information according to the probability distribution of the quantized first feature information to obtain the first code stream.
  • the probability distribution is used to encode the quantized first feature information to obtain the first code stream.
  • the embodiment of the present application also includes the step of determining the reconstructed image of the current image, that is, the embodiment of the present application also includes the following S404:
  • the above S404 includes the following steps:
  • the above-mentioned quantized first feature information is feature information obtained by quantizing the first feature information at the encoding end.
  • the above-mentioned quantized first feature information is reconstructed by the encoding end.
  • the encoding end decodes the second code stream to obtain the quantized second feature information, and based on the quantized second feature information Information, determine the probability distribution of the quantized first feature information.
  • the encoding end obtains the probability distribution of the quantized first feature information according to the method of S403-A31 to S403-A33 above, and then uses the quantized first feature information. Decode the first code stream using a probability distribution of feature information to obtain quantized first feature information.
  • the encoding end performs multi-level time domain fusion on the quantized first feature information obtained above to obtain a hybrid spatiotemporal representation.
  • multi-level time domain fusion is performed on the quantized first feature information, that is, the quantized first feature information is not only combined with the feature information of the previous reconstructed image of the current image Fusion is performed, and the quantized first feature information is feature fused with multiple reconstructed images before the current image, for example, the reconstructed images at multiple times such as t-1 time, t-2 time..., t-k time, etc. are merged with the quantized The first feature information is fused.
  • the occluded information can be obtained from several reconstructed images before the current image, thereby making the generated hybrid spatiotemporal representation include more accurate, rich and detailed Feature information.
  • the accuracy of the generated predicted images can be improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate predicted images. , thereby improving the video compression effect.
  • the embodiments of this application do not limit the specific method by which the encoding end performs multi-level time domain fusion on the quantized first feature information to obtain the hybrid spatiotemporal representation.
  • the encoding end mixes spatiotemporal representations through a recursive aggregation module, that is, the above S404-A includes the following steps of S404-A1:
  • the encoding end uses the recursive aggregation module to fuse the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment to obtain a hybrid spatiotemporal representation.
  • the recursive aggregation module of the embodiment of the present application will learn and retain the deep-level feature information learned from this feature information each time it generates a mixed spatio-temporal representation, and use the learned deep-level features as implicit feature information. to generate the next mixed spatio-temporal representation, thereby improving the accuracy of the generated mixed spatio-temporal representation. That is to say, in the embodiment of this application, the implicit feature information of the recursive aggregation module at the previous moment includes the feature information of multiple reconstructed images before the current image learned by the recursive aggregation module.
  • the encoding end uses the recursive aggregation module to Fusing the quantified first feature information with the implicit feature information of the recursive aggregation module at the previous moment can generate a more accurate, rich and detailed hybrid spatio-temporal representation.
  • the embodiments of this application do not limit the specific network structure of the recursive aggregation module, for example, it can be any network structure that can realize the above functions.
  • the recursive aggregation module is stacked by at least one spatio-temporal recursive network ST-LSTM.
  • the expression formula of the above hybrid spatio-temporal representation Gt is as shown in the above formula (1).
  • S404-B Perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation to obtain P predicted images of the current image, where P is a positive integer.
  • the hybrid spatio-temporal representation in the embodiment of the present application fuses the current image and the feature information of multiple reconstructed images before the current image.
  • the previous reconstructed image is motion compensated according to the hybrid spatio-temporal representation, and an accurate current image can be obtained. P predicted images.
  • the embodiment of the present application does not place a limit on the specific number of P predicted images generated. That is, in the embodiment of the present application, the encoding end can use different methods to perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation, and obtain P predicted images of the current image.
  • the embodiments of the present application do not limit the specific manner in which the encoding end performs motion compensation on the previous reconstructed image based on the mixed spatiotemporal representation.
  • the P predicted images include a first predicted image, which is obtained by the encoding end using optical flow motion compensation.
  • the above S404-B includes the following S404-B1 and S404- Steps for B2:
  • S404-B2 Perform motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.
  • the embodiments of this application do not limit the specific way in which the encoding end determines the optical flow motion information based on the mixed spatiotemporal representation.
  • the encoding end obtains optical flow motion information through a pre-trained neural network model, that is, the neural network model can predict optical flow motion information based on mixed spatiotemporal representation.
  • the neural network model may be called a first decoder, or optical flow signal decoder Df.
  • the encoding end inputs the mixed spatio-temporal representation Gt into the optical flow signal decoder Df to predict the optical flow motion information, and obtains the optical flow motion information f x,y output by the optical flow signal decoder Df.
  • the f x, y is the optical flow motion information of channel 2.
  • the optical flow signal decoder Df is composed of multiple NLAMs and multiple upsampling modules.
  • the optical flow signal decoder Df includes 1 NLAM, 3 LAMs and 4 There are two downsampling modules, one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module.
  • the optical flow motion information f x, y is used to reconstruct the previous image. Motion compensation is performed to obtain the first predicted image X 1 .
  • the embodiments of this application do not limit the specific method by which the encoding end performs motion compensation on the previous reconstructed image based on the optical flow motion information to obtain the first predicted image.
  • the encoding end uses the optical flow motion information f x, y to perform motion compensation on the previous reconstructed image.
  • image Linear interpolation is performed, and the image generated by the interpolation is recorded as the first predicted image X 1 .
  • the encoding end obtains the first predicted image X 1 through the following formula (3).
  • the encoding end uses the optical flow motion information f x, y to reconstruct the previous image through Warping operation. Motion compensation is performed to obtain the first predicted image X 1 .
  • the P predicted images include a second predicted image, which is obtained by the decoder using offset motion compensation.
  • the above S404-B includes the following S404-B-1 to Steps for S404-B-3:
  • the encoding end obtains the offset corresponding to the current image through a pre-trained neural network model, that is, the neural network model can predict the offset based on the mixed spatiotemporal representation, and the offset is lossy. offset information.
  • the neural network model may be called the second decoder, or variable convolutional decoder Dm. The encoding end inputs the mixed spatio-temporal representation Gt into the variable convolution decoder Dm to predict the offset information.
  • the encoding end performs spatial feature extraction on the previous reconstructed image to obtain reference feature information.
  • the encoding end uses the spatial feature extraction module SFE to extract spatial features from the previous reconstructed image to obtain reference feature information.
  • the encoding end uses the offset to perform motion compensation on the extracted reference feature information to obtain a second predicted image of the current image.
  • Embodiments of the present application do not limit the specific manner in which the encoding end uses the offset to perform motion compensation on the extracted reference feature information to obtain the second predicted image of the current image.
  • the encoding end uses the offset to perform motion compensation based on deformable convolution on the reference feature information to obtain the second predicted image.
  • the transformable convolution can generate the offset corresponding to the current image based on the mixed spatio-temporal representation
  • the encoding end inputs the mixed spatio-temporal representation Gt and the reference feature information into the transformable convolution
  • the transformable convolution generates an offset corresponding to the current image based on the mixed spatiotemporal representation Gt, and the offset is applied to the reference feature information for motion compensation, thereby obtaining the second predicted image.
  • variable convolution decoder Dm in the embodiment of the present application includes a transformable convolution DCN, and the encoding end converts the previous reconstructed image into Input the inverse transformation module SFE for spatio-temporal feature extraction to obtain reference feature information.
  • the mixed spatio-temporal representation Gt and the reference feature information are input into the transformable convolution DCN for offset extraction and motion compensation to obtain the second predicted image X 2 .
  • the encoding end generates the second predicted image X 2 through the above formula (4).
  • variable convolution decoder Dm in addition to the transformable convolution DCN, also includes 1 NLAM, 3 LAMs and There are 4 downsampling modules, one of which is connected to a downsampling module after an NLAM, and a downsampling module is connected to after a LAM.
  • the encoding end first converts the previous reconstructed image Input the inverse transformation module SFE for spatio-temporal feature extraction to obtain reference feature information.
  • the mixed spatio-temporal representation Gt and the reference feature information are input into the transformable convolution DCN in the variable convolution decoder Dm to perform offset extraction and motion compensation to obtain a feature information, which is input into the NLAM , after feature extraction by NLAM, 3 LAMs and 4 downsampling modules, it is finally restored to the second predicted image X 2 .
  • the encoding end can determine P predicted images, for example, determine the first predicted image and the second predicted image, and then perform the following step S204.
  • S404-C Determine the reconstructed image of the current image based on the P predicted images.
  • the reconstructed image of the current image is determined based on the predicted image.
  • the predicted image For example, compare the predicted image with one or several previous reconstructed images of the current image, and calculate the loss. If the loss is small, it means that the prediction accuracy of the predicted image is high, and the predicted image can be determined as the reconstruction of the current image. image.
  • the reconstructed image of the current image can be determined based on the previous one or several reconstructed images of the current image and the predicted image. For example, the reconstructed image of the current image can be determined.
  • the predicted image and one or several previous reconstructed images of the current image are input into a neural network to obtain a reconstructed image of the current image.
  • the above S404-C includes the following steps of S404-C-A and S404-C-B:
  • S404-C-A Determine the target predicted image of the current image based on the P predicted images.
  • the encoding end first determines the target predicted image of the current image based on P predicted images, and then implements the reconstructed image of the current image based on the target predicted image of the current image, thereby improving the accuracy of determining the reconstructed image.
  • the embodiment of the present application does not limit the specific method of determining the target predicted image of the current image based on the P predicted images.
  • the one predicted image is determined as the target predicted image of the current image.
  • S404-C-A includes S404-C-A11 and S404-C-A12:
  • the P predicted images are weighted to generate a weighted image, then according to the Weight the image to obtain the target prediction image.
  • the embodiment of the present application does not limit the specific method of determining the weighted image based on the P predicted images.
  • the weights corresponding to P predicted images are determined; and the P predicted images are weighted according to the weights corresponding to the P predicted images to obtain weighted images.
  • the encoding end determines the first weight corresponding to the first predicted image and the second weight corresponding to the second predicted image, and based on the first weight and the The second weight is used to weight the first predicted image and the second predicted image to obtain a weighted image.
  • the methods for determining the weights corresponding to the P predicted images include but are not limited to the following:
  • Method 2 The encoding end performs adaptive masking based on the mixed spatiotemporal representation to obtain weights corresponding to P predicted images.
  • the encoding end uses a neural network model to generate weights corresponding to P predicted images.
  • the neural network model is pre-trained and can be used to generate weights corresponding to P predicted images.
  • this neural network model is also called the third decoder or adaptive mask compensation decoder Dw .
  • the encoding end inputs the mixed spatio-temporal representation into the adaptive mask compensation decoder Dw to perform adaptive masking, and obtains the weights corresponding to the P predicted images.
  • the encoding end inputs the mixed spatio-temporal representation Gt into the adaptive mask compensation decoder D w for adaptive masking, and the adaptive mask compensation decoder D w outputs the first weight w1 and the second prediction of the first predicted image.
  • the second weight w2 of the image is used to obtain the first predicted image X 1 and the second predicted image X 2 based on the first weight w1 and the second weight w2, and the corresponding information representing different areas in the predicted frame can be adaptively selected, A weighted image is then generated.
  • the weighted image X 3 is generated according to the above formula (5).
  • the weights corresponding to the P predicted images are a matrix, including the weight corresponding to each pixel in the predicted image, so that when generating a weighted image, for each pixel in the current image, P The predicted value and weight corresponding to the pixel in each predicted image are weighted to obtain the weighted predicted value of the pixel, so that the weighted predicted value corresponding to each pixel in the current image constitutes the weighted image of the current image.
  • the embodiment of the present application does not limit the specific network structure of the above-mentioned adaptive mask compensation decoder D w .
  • the adaptive mask compensation decoder Dw includes 1 NLAM, 3 LAMs, 4 downsampling modules and a sigmoid function, where one NLAM is followed by a downsampling module, A downsampling module is connected after a LAM.
  • the encoding end weights the P prediction images according to the above method. After obtaining the weighted images, the following S404-C-A12 is performed.
  • the weighted image is determined as the target prediction image.
  • the encoding end can also obtain the residual image of the current image based on the mixed spatiotemporal representation.
  • the encoding end uses a neural network model to obtain the residual image of the current image.
  • the neural network model is pre-trained and can be used to generate the residual image of the current image.
  • this neural network model is also called the fourth decoder or spatial texture enhancement decoder Dt.
  • the spatial texture enhancement decoder Dt includes 1 NLAM, 3 LAMs, and 4 downsampling modules, where one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module. module.
  • determining the target predicted image of the current image based on the P predicted images in S404-CA above includes the following steps of S404-C-A21:
  • a target predicted image is obtained based on the predicted image and the residual image. For example, the predicted image and the residual image are added to generate the target predicted image.
  • P is greater than 1, first determine the weighted image based on P predicted images; then determine the target predicted image based on the weighted image and the residual image.
  • the specific process of determining the weighted image by the encoding end based on the P predicted images can refer to the specific description of S204-A11 above, which will not be described again here.
  • the first weight w1 corresponding to the first predicted image and the second weight w2 corresponding to the second predicted image are determined.
  • the first weight w1 corresponding to the second predicted image is determined.
  • the first predicted image and the second predicted image are weighted to obtain a weighted image X 3 , and then the residual image X r is used to enhance the weighted image X 3 to obtain a target predicted image.
  • the target predicted image X 4 is generated according to the above formula (6).
  • S404-C-B Determine the reconstructed image of the current image based on the target prediction image.
  • the target predicted image is compared with one or several previous reconstructed images of the current image, and the loss is calculated. If the loss is small, it means that the prediction accuracy of the target predicted image is high, and the target predicted image can be The image is determined to be the reconstructed image of the current image. If the above loss is large, it means that the prediction accuracy of the target prediction image is low.
  • the reconstructed image of the current image can be determined based on the previous one or several reconstructed images of the current image and the target prediction image. For example, the target The predicted image and one or several previous reconstructed images of the current image are input into a neural network to obtain a reconstructed image of the current image.
  • the encoding end determines the residual value of the current image based on the current image and the target predicted image; the residual value is encoded to obtain a residual code stream.
  • the embodiment of the present application also includes residual decoding.
  • the above S404-C-B includes the following steps of S404-C-B1 and S404-C-B2:
  • the encoding end in order to improve the effect of reconstructing the image, the encoding end also generates a residual code stream through residual coding. Specifically, the encoding end determines the residual value of the current image and encodes the residual value. Generate residual code stream. Correspondingly, the encoding end decodes the residual code stream to obtain the residual value of the current image, and obtains the reconstructed image based on the target prediction image and residual value.
  • the embodiment of the present application does not limit the specific expression form of the residual value of the above-mentioned current image.
  • the residual value of the current image is a matrix, and each element in the matrix is the residual value corresponding to each pixel in the current image.
  • the encoding end can add the residual value and prediction value corresponding to each pixel in the target prediction image pixel by pixel to obtain the reconstruction value of each pixel, and then obtain the reconstructed image of the current image.
  • the predicted value corresponding to the i-th pixel is obtained, and the residual value corresponding to the i-th pixel is obtained from the residual value of the current image.
  • the reconstruction value corresponding to each pixel in the current image can be obtained.
  • the reconstruction value corresponding to each pixel in the current image forms the reconstructed image of the current image. .
  • the embodiments of this application do not limit the specific method by which the encoding end obtains the residual value of the current image. That is to say, the embodiments of this application do not limit the residual encoding and decoding methods used by both encoding and decoding ends.
  • the encoding end determines the target predicted image of the current image, and then obtains the residual value of the current image based on the current image and the target predicted image. For example, the difference between the current image and the target predicted image is determined as the current image. The residual value of the image. Next, the residual value of the current image is encoded to generate a residual code.
  • the residual value of the current image can be transformed to obtain the transformation coefficient, the transformation coefficient can be quantized to obtain the quantized coefficient, and the quantized coefficient can be encoded to obtain the residual code stream.
  • the encoding end decodes the residual code stream to obtain the residual value of the current image, for example, decodes the residual code stream to obtain the quantization coefficient, and performs inverse quantization and inverse transformation on the quantization coefficient to obtain the residual value of the current image. Then, according to the above method, the residual values corresponding to the target prediction image and the current image are added to obtain a reconstructed image of the current image.
  • the encoding end may use a neural network method to process the current image and the target predicted image of the current image, generate a residual value of the current image, encode the residual value of the current image, and generate a residual code stream.
  • the encoding end can obtain the reconstructed image of the current image according to the above method.
  • the reconstructed image can be displayed directly.
  • the reconstructed image can also be stored in a cache for subsequent image encoding.
  • the encoding end obtains the first feature information by performing feature fusion on the current image and the previous reconstructed image of the current image; the first feature information is quantized to obtain the quantized first feature information; Encode the quantized first feature information to obtain a first code stream, so that the decoder decodes the first code stream, determines the quantized first feature information, and performs multi-level time domain fusion on the quantized first feature information. , obtain a mixed spatio-temporal representation; perform motion compensation on the previous reconstructed image according to the mixed spatio-temporal representation, and obtain P predicted images of the current image; and then determine the reconstructed image of the current image based on the P predicted images.
  • the quantized first feature information is feature fused with multiple reconstructed images before the current image, so that It can avoid that when certain information in the previous reconstructed image of the current image is occluded, the occluded information can be obtained from several reconstructed images before the current image, so that the generated hybrid spatio-temporal representation includes more accurate, rich and detailed features. information.
  • high-precision P predicted images can be generated. Based on the high-precision P predicted images, the reconstructed image of the current image can be accurately obtained, thereby improving the video compression effect. .
  • an end-to-end neural network-based encoding and decoding framework is proposed.
  • the neural network-based encoding and decoding framework includes a neural network-based encoder and a neural network-based decoder.
  • the encoding process of the embodiment of the present application will be introduced below in conjunction with a possible encoder based on neural networks of the present application.
  • Figure 12 is a schematic network structure diagram of a neural network-based encoder according to an embodiment of the present application, including: a spatiotemporal feature extraction module, an inverse transformation module, a recursive aggregation module and a hybrid motion compensation module.
  • the spatiotemporal feature extraction module is used to extract and downsample features of the cascaded current image and the previous reconstructed image to obtain the first feature information.
  • the inverse transformation module is used to perform inverse transformation on the quantized second feature information to obtain reconstructed feature information of the first feature information.
  • its network structure is shown in Figure 3.
  • the recursive aggregation module is used to perform multi-level time domain fusion on the quantized first feature information to obtain a hybrid spatio-temporal representation.
  • its network structure is shown in Figure 4.
  • the hybrid motion compensation module is used to perform hybrid motion compensation on the mixed spatio-temporal representation to obtain the target predicted image of the current image.
  • the hybrid motion compensation module may include the first decoder shown in Figure 5, and/or the first decoder shown in Figure 6 optionally, if the hybrid motion compensation module includes a first decoder and a second decoder, the hybrid motion compensation module may also include a third decoder shown in Figure 7 . In some embodiments, the hybrid motion compensation module may further include a fourth decoder as shown in FIG. 8 .
  • the embodiment of the present application takes the motion compensation module including a first decoder, a second decoder, a third decoder, and a fourth decoder as an example for description.
  • Figure 13 is a schematic diagram of the video encoding process provided by an embodiment of the present application. As shown in Figure 13, it includes:
  • the encoding end combines the current image X t and the previous reconstructed image of the current image Perform cascade between channels to obtain X cat , and then perform feature extraction on the cascaded image X cat to obtain the first feature information.
  • the specific network structure of the inverse transformation module is shown in Figure 3, including 2 non-local self-attention modules and 2 upsampling modules.
  • the decoding end inputs the quantized second feature information into an inverse transformation module for inverse transformation, and the inverse transformation module outputs reconstructed feature information.
  • S509 Encode the quantized first feature information according to the probability distribution of the quantized first feature information to obtain the first code stream.
  • Embodiments of the present application also include a process of determining the reconstructed image.
  • S510 Decode the first code stream according to the probability distribution of the quantized first feature information to obtain the quantized first feature information.
  • the recursive aggregation module is stacked by at least one spatiotemporal recursive network.
  • the decoding end inputs the quantized first feature information into the recursive aggregation module, so that the recursive aggregation module fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment, and then outputs a mixed spatiotemporal representation.
  • the recursive aggregation module fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment, and then outputs a mixed spatiotemporal representation.
  • S512 Process the mixed spatiotemporal representation through the first decoder to obtain the first predicted image.
  • the mixed spatiotemporal representation and the previous reconstructed image are input into the hybrid motion compensation module for motion blending compensation to obtain the target prediction image of the current image.
  • the mixed spatio-temporal representation is processed by the first decoder to determine the optical flow motion information, and motion compensation is performed on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.
  • the network structure of the first decoder is shown in Figure 5.
  • SFE is used to extract spatial features from the previous reconstructed image to obtain the reference feature information; the reference feature information and the mixed spatio-temporal representation are input into the second decoder, so that the offset motion compensates the reference feature information to obtain the second prediction. image.
  • the network structure of the second decoder is shown in Figure 6.
  • the mixed spatio-temporal representation is input to the third decoder for adaptive masking to obtain the first weight corresponding to the first predicted image and the second weight corresponding to the second predicted image.
  • the network structure of the third decoder is shown in Figure 7.
  • the product of the first weight and the first predicted image is added to the product of the second weight and the second predicted image to obtain a weighted image.
  • S516 Process the mixed spatiotemporal representation through the fourth decoder to obtain the residual image of the current image.
  • the mixed spatiotemporal representation is input to the fourth decoder for processing to obtain the residual image of the current image.
  • the weighted image and the residual image are added together to determine the target prediction image.
  • multi-level time domain fusion is performed on the quantized first feature information, that is, the quantized first feature information is combined with the multi-level information before the current image.
  • Feature fusion is performed on the reconstructed images so that the generated hybrid spatio-temporal representation includes more accurate, rich and detailed feature information.
  • motion compensation is performed on the previous reconstructed image to generate multiple decoding information.
  • the multiple decoding information includes the first predicted image, the second predicted image, the first predicted image and the second predicted image respectively.
  • Weights, and residual images so that when the target prediction image of the current image is determined based on these multiple decoding information, the accuracy of the target prediction image can be effectively improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate prediction image, and then the reconstructed image of the current image can be accurately obtained. Improve video compression effect.
  • FIG. 2 to FIG. 13 are only examples of the present application and should not be understood as limitations of the present application.
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution.
  • the execution order of each process should be determined by its functions and internal logic, and should not be used in this application.
  • the implementation of the examples does not constitute any limitations.
  • the term "and/or" is only an association relationship describing associated objects, indicating that three relationships can exist. Specifically, A and/or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.
  • the character "/" in this article generally indicates that the related objects are an "or" relationship.
  • Figure 14 is a schematic block diagram of a video decoding device provided by an embodiment of the present application.
  • the video decoding device 10 includes:
  • the decoding unit 11 is used to decode the first code stream and determine the quantized first feature information, which is obtained by feature fusion of the current image and the previous reconstructed image of the current image;
  • the fusion unit 12 is used to perform multi-level time domain fusion on the quantized first feature information to obtain a mixed spatiotemporal representation
  • the compensation unit 13 is configured to perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation to obtain P predicted images of the current image, where P is a positive integer;
  • the reconstruction unit 14 is configured to determine the reconstructed image of the current image according to the P predicted images.
  • the fusion unit 12 is specifically configured to fuse the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment through a recursive aggregation module to obtain the mixed space-time representation.
  • the recursive aggregation module is stacked by at least one spatiotemporal recursive network.
  • the P predicted images include the first predicted image
  • the compensation unit 13 is specifically configured to determine the optical flow motion information according to the mixed spatiotemporal representation; and calculate the previous prediction image according to the optical flow motion information.
  • the reconstructed image is motion compensated to obtain the first predicted image.
  • the P predicted images include a second predicted image
  • the compensation unit 13 is specifically configured to obtain the offset corresponding to the current image according to the mixed spatiotemporal representation; for the previous reconstructed image Perform spatial feature extraction to obtain reference feature information; use the offset to perform motion compensation on the reference feature information to obtain the second predicted image.
  • the compensation unit 13 is specifically configured to use the offset to perform motion compensation based on deformable convolution on the reference feature information to obtain the second predicted image.
  • the reconstruction unit 14 is configured to determine a target predicted image of the current image based on the P predicted images; and determine a reconstructed image of the current image based on the target predicted image.
  • the reconstruction unit 14 is configured to determine a weighted image based on the P predicted images; and obtain the target predicted image based on the weighted images.
  • the reconstruction unit 14 is further configured to obtain the residual image of the current image based on the mixed spatio-temporal representation; and obtain the target predicted image based on the P predicted images and the residual image. .
  • the reconstruction unit 14 is specifically configured to determine a weighted image according to the P prediction images; and determine the target prediction image according to the weighted image and the residual image.
  • the reconstruction unit 14 is specifically configured to determine the weights corresponding to the P predicted images; weight the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted image .
  • the reconstruction unit 14 is specifically configured to perform adaptive masking according to the mixed spatiotemporal representation to obtain weights corresponding to the P predicted images.
  • the reconstruction unit 14 is specifically configured to determine the first weight corresponding to the first predicted image and the second predicted image.
  • Corresponding second weight weight the first predicted image and the second predicted image according to the first weight and the second weight to obtain the weighted image.
  • the reconstruction unit 14 is specifically configured to decode the residual code stream to obtain the residual value of the current image; and obtain the reconstructed image according to the target prediction image and the residual value.
  • the decoding unit 11 is specifically used to decode the second code stream to obtain quantized second feature information.
  • the second feature information is obtained by performing feature transformation on the first feature information; according to the quantization Determine the probability distribution of the quantized first characteristic information based on the second characteristic information after quantization, decode the first code stream according to the probability distribution of the quantized first characteristic information, and obtain the quantized the first characteristic information.
  • the decoding unit 11 is specifically configured to perform inverse transformation on the quantized second feature information to obtain reconstructed feature information; determine the probability distribution of the reconstructed feature information; and according to the probability of the reconstructed feature information Distribution, predict the probability distribution of the quantized first feature information.
  • the decoding unit 11 is specifically configured to perform N times of non-local attention transformation and N times of upsampling on the quantized second feature information to obtain the reconstructed feature information, where N is a positive integer. .
  • the decoding unit 11 is specifically configured to predict the probability of encoding pixels in the quantized first feature information according to the probability distribution of the reconstructed feature information; The probability of encoding a pixel is obtained to obtain a probability distribution of the quantized first feature information.
  • the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
  • the video decoding device 10 shown in FIG. 14 may correspond to the corresponding subject in performing the method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the video decoding device 10 are respectively to implement the method, etc. The corresponding processes in each method will not be repeated here for the sake of brevity.
  • Figure 15 is a schematic block diagram of a video encoding device provided by an embodiment of the present application.
  • the video encoding device 20 includes:
  • the fusion unit 21 is used to perform feature fusion on the current image and the previous reconstructed image of the current image to obtain the first feature information
  • the quantization unit 22 is used to quantize the first feature information to obtain the quantized first feature information
  • the encoding unit 23 is configured to encode the quantized first feature information to obtain the first code stream.
  • the fusion unit 21 is specifically configured to channel-concatenate the current image and the reconstructed image to obtain a cascaded image; perform feature extraction on the cascaded image to obtain the First characteristic information.
  • the fusion unit 21 is specifically configured to perform Q times of non-local attention transformation and Q times of downsampling on the concatenated image to obtain the first feature information, where the Q is a positive integer.
  • the encoding unit 23 is also used to perform feature transformation according to the first feature information to obtain the second feature information; to quantize the second feature information and then encode it to obtain the second code stream; Decoding the second code stream to obtain the quantized second feature information, and determining the probability distribution of the quantized first feature information based on the quantized second feature information; based on the quantized second feature information The probability distribution of the first feature information is used to encode the quantized first feature information to obtain a first code stream.
  • the encoding unit 23 is specifically configured to perform N times of non-local attention transformation and N times of downsampling on the first feature information to obtain the second feature information, where N is a positive integer.
  • the encoding unit 23 is specifically configured to perform N times of non-local attention transformation and N times of downsampling on the quantized first feature information to obtain the second feature information.
  • the encoding unit 23 is also used to quantize the second feature information to obtain the quantized second feature information; determine the probability distribution of the quantized second feature information; according to the quantized The probability distribution of the second feature information is then encoded, and the quantized second feature information is encoded to obtain the second code stream.
  • the encoding unit 23 is specifically configured to perform inverse transformation on the quantized second feature information to obtain reconstructed feature information; determine the probability distribution of the reconstructed feature information; and according to the probability of the reconstructed feature information Distribution determines the probability distribution of the quantized first feature information.
  • the encoding unit 23 is specifically configured to perform N times of non-local attention transformation and N times of upsampling on the quantized second feature information to obtain the reconstructed feature information, where N is a positive integer. .
  • the encoding unit 23 is specifically configured to determine the probability of encoding a pixel in the quantized first feature information according to the probability distribution of the reconstructed feature information; The probability of encoding a pixel is obtained to obtain a probability distribution of the quantized first feature information.
  • the encoding unit 23 is also used to determine the reconstructed image of the current image.
  • the encoding unit 23 is specifically configured to perform multi-level time domain fusion on the quantized first feature information to obtain a mixed spatiotemporal representation; and perform motion on the previous reconstructed image according to the mixed spatiotemporal representation. Compensation is performed to obtain P predicted images of the current image, where P is a positive integer; based on the P predicted images, a reconstructed image of the current image is determined.
  • the encoding unit 23 is specifically configured to fuse the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment through a recursive aggregation module to obtain the mixed space-time representation.
  • the recursive aggregation module is stacked by at least one spatiotemporal recursive network.
  • the P predicted images include the first predicted image
  • the encoding unit 23 is specifically configured to determine optical flow motion information based on the mixed spatiotemporal representation;
  • the reconstructed image is motion compensated to obtain the first predicted image.
  • the P predicted images include a second predicted image
  • the encoding unit 23 is specifically configured to obtain the offset corresponding to the current image according to the mixed spatiotemporal representation; Perform spatial feature extraction to obtain reference feature information; use the offset to perform motion compensation on the reference feature information to obtain the second predicted image.
  • the encoding unit 23 is specifically configured to use the offset to perform motion compensation based on deformable convolution on the reference feature information to obtain the second predicted image.
  • the encoding unit 23 is specifically configured to determine the target predicted image of the current image based on the P predicted images; and determine the reconstructed image of the current image based on the target predicted image.
  • the encoding unit 23 is specifically configured to determine a weighted image based on the P predicted images; and obtain the target predicted image based on the weighted images.
  • the encoding unit 23 is further configured to obtain the residual image of the current image based on the mixed spatio-temporal representation; and obtain the target predicted image based on the P predicted images and the residual image.
  • the encoding unit 23 is specifically configured to determine a weighted image according to the P prediction images; and determine the target prediction image according to the weighted image and the residual image. .
  • the encoding unit 23 is specifically configured to determine the weights corresponding to the P predicted images; weight the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted image .
  • the encoding unit 23 is specifically configured to perform adaptive masking according to the mixed spatiotemporal representation to obtain weights corresponding to the P predicted images.
  • the encoding unit 23 is specifically configured to determine the P predicted images, and determine the first predicted image corresponding to the first predicted image.
  • the weight corresponds to the second weight corresponding to the second predicted image; according to the first weight and the second weight, the first predicted image and the second predicted image are weighted to obtain the weighted image.
  • the encoding unit 23 is also configured to determine the residual value of the current image according to the current image and the target predicted image; encode the residual value to obtain a residual code stream.
  • the encoding unit 23 is specifically configured to decode the residual code stream to obtain the residual value of the current image; and obtain the reconstruction according to the target predicted image and the residual value. image.
  • the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
  • the video encoding device 20 shown in FIG. 15 may correspond to the corresponding subject in performing the method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the video encoding device 20 are respectively to implement the method, etc. The corresponding processes in each method will not be repeated here for the sake of brevity.
  • the software unit may be located in a mature storage medium in this field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, register, etc.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps in the above method embodiment in combination with its hardware.
  • Figure 16 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 30 may be the video encoder or video decoder described in the embodiment of the present application.
  • the electronic device 30 may include:
  • Memory 33 and processor 32 the memory 33 is used to store the computer program 34 and transmit the program code 34 to the processor 32.
  • the processor 32 can call and run the computer program 34 from the memory 33 to implement the method in the embodiment of the present application.
  • the processor 32 may be configured to perform steps in the above method according to instructions in the computer program 34 .
  • the processor 32 may include but is not limited to:
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the memory 33 includes but is not limited to:
  • Non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory may be Random Access Memory (RAM), which is used as an external cache.
  • RAM Random Access Memory
  • RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDR SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the computer program 34 can be divided into one or more units, and the one or more units are stored in the memory 33 and executed by the processor 32 to complete the tasks provided by this application.
  • the one or more units may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program 34 in the electronic device 30 .
  • the electronic device 30 may also include:
  • Transceiver 33 the transceiver 33 can be connected to the processor 32 or the memory 33 .
  • the processor 32 can control the transceiver 33 to communicate with other devices. Specifically, it can send information or data to other devices, or receive information or data sent by other devices.
  • Transceiver 33 may include a transmitter and a receiver.
  • the transceiver 33 may further include an antenna, and the number of antennas may be one or more.
  • bus system where in addition to the data bus, the bus system also includes a power bus, a control bus and a status signal bus.
  • Figure 17 is a schematic block diagram of the video encoding and decoding system 40 provided by the embodiment of the present application.
  • the video encoding and decoding system 40 may include: a video encoder 41 and a video decoder 42, where the video encoder 41 is used to perform the video encoding method involved in the embodiment of the present application, and the video decoder 42 is used to perform
  • the embodiment of the present application relates to a video decoding method.
  • this application also provides a code stream, which is obtained by the above encoding method.
  • This application also provides a computer storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a computer, the computer can perform the method of the above method embodiment.
  • embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to perform the method of the above method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted over a wired connection from a website, computer, server, or data center (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, magnetic tapes), optical media (such as digital video discs (DVD)), or semiconductor media (such as solid state disks (SSD)), etc.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separate.
  • a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or it may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in various embodiments of the present application can be integrated into a processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例提供一种视频编解码方法、装置、设备、系统及存储介质,为了提高重建图像的准确性,通过对量化后的第一特征信息进行多级时域融合,即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合,并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,这样可以避免当前图像的前一重建图像中的某信息被遮挡时,被遮挡的信息可以从当前图像之前的几张重建图像中得到,进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时,可以生成高精度的P个预测图像时,基于该高精度的P个预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。

Description

视频编解码方法、装置、设备、系统及存储介质 技术领域
本申请涉及视频编解码技术领域,尤其涉及一种视频编解码方法、装置、设备、系统及存储介质。
背景技术
数字视频技术可以并入多种视频装置中,例如数字电视、智能手机、计算机、电子阅读器或视频播放器等。随着视频技术的发展,视频数据所包括的数据量较大,为了便于视频数据的传输,视频装置执行视频压缩技术,以使视频数据更加有效的传输或存储。
随着神经网络技术的快速发展,神经网络技术在视频压缩技术中得到广泛应用,例如,在环路滤波、编码块划分和编码块预测等中得到应用。但是,目前的基于神经网络的视频压缩技术,压缩效果不佳。
发明内容
本申请实施例提供了一种视频编解码方法、装置、设备、系统及存储介质,以提高视频压缩效果。
第一方面,本申请提供了一种视频解码方法,包括:
解码第一码流,确定量化后的第一特征信息,所述第一特征信息是对当前图像和所述当前图像的前一重建图像进行特征融合得到的;
对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征;
根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,所述P为正整数;
根据所述P个预测图像,确定所述当前图像的重建图像。
第二方面,本申请实施例提供一种视频编码方法,包括:
对当前图像以及所述当前图像的前一重建图像进行特征融合,得到第一特征信息;
对所述第一特征信息进行量化,得到量化后的所述第一特征信息;
对量化后的所述第一特征信息进行编码,得到所述第一码流。
第三方面,本申请提供了一种视频编码器,用于执行上述第一方面或其各实现方式中的方法。具体地,该编码器包括用于执行上述第一方面或其各实现方式中的方法的功能单元。
第四方面,本申请提供了一种视频解码器,用于执行上述第二方面或其各实现方式中的方法。具体地,该解码器包括用于执行上述第二方面或其各实现方式中的方法的功能单元。
第五方面,提供了一种视频编码器,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第一方面或其各实现方式中的方法。
第六方面,提供了一种视频解码器,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第二方面或其各实现方式中的方法。
第七方面,提供了一种视频编解码系统,包括视频编码器和视频解码器。视频编码器用于执行上述第一方面或其各实现方式中的方法,视频解码器用于执行上述第二方面或其各实现方式中的方法。
第八方面,提供了一种芯片,用于实现上述第一方面至第二方面中的任一方面或其各实现方式中的方法。具体地,该芯片包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有该芯片的设备执行如上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第九方面,提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第十方面,提供了一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第十一方面,提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第十二方面,提供了一种码流,包括第二方面任一方面生成的码流。
基于以上技术方案,本申请为了提高重建图像的准确性,通过对量化后的第一特征信息进行多级时域融合,即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合,并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,这样可以避免当前图像的前一重建图像中的某信息被遮挡时,被遮挡的信息可以从当前图像之前的几张重建图像中得到,进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时,可以生成高精度的P个预测图像时,基于该高精度的P个预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
附图说明
图1为本申请实施例涉及的一种视频编解码系统的示意性框图;
图2为本申请实施例提供的一种视频解码方法的流程示意图;
图3是本申请实施例涉及的反变换模块的网络结构示意图;
图4是本申请实施例涉及的递归聚合模块的网络结构示意图;
图5是本申请实施例涉及的第一解码器的网络结构示意图;
图6是本申请实施例涉及的第二解码器的网络结构示意图;
图7是本申请实施例涉及的第三解码器的网络结构示意图;
图8是本申请实施例涉及的第四解码器的网络结构示意图
图9为本申请一实施例涉及的一种基于神经网络的解码器的网络结构示意图;
图10为本申请一实施例提供的视频解码流程示意图;
图11为本申请实施例提供的视频编码方法的一种流程示意图;
图12为本申请一实施例涉及的一种基于神经网络的编码器的网络结构示意图;
图13为本申请一实施例提供的视频编码流程示意图;
图14是本申请实施例提供的视频解码装置的示意性框图;
图15是本申请实施例提供的视频编码装置的示意性框图;
图16是本申请实施例提供的电子设备的示意性框图;
图17是本申请实施例提供的视频编码系统的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
本申请可应用于图像编解码领域、视频编解码领域、硬件视频编解码领域、专用电路视频编解码领域、实时视频编解码领域等。或者,本申请的方案可结合至其它专属或行业标准而操作,所述标准包含ITU-TH.261、ISO/IECMPEG-1Visual、ITU-TH.262或ISO/IECMPEG-2Visual、ITU-TH.263、ISO/IECMPEG-4Visual,ITU-TH.264(还称为ISO/IECMPEG-4AVC),包含可分级视频编解码(SVC)及多视图视频编解码(MVC)扩展。应理解,本申请的技术不限于任何特定编解码标准或技术。
为了便于理解,首先结合图1对本申请实施例涉及的视频编解码系统进行介绍。
图1为本申请实施例涉及的一种视频编解码系统的示意性框图。需要说明的是,图1只是一种示例,本申请实施例的视频编解码系统包括但不限于图1所示。如图1所示,该视频编解码系统100包含编码设备110和解码设备120。其中编码设备用于对视频数据进行编码(可以理解成压缩)产生码流,并将码流传输给解码设备。解码设备对编码设备编码产生的码流进行解码,得到解码后的视频数据。
本申请实施例的编码设备110可以理解为具有视频编码功能的设备,解码设备120可以理解为具有视频解码功能的设备,即本申请实施例对编码设备110和解码设备120包括更广泛的装置,例如包含智能手机、台式计算机、移动计算装置、笔记本(例如,膝上型)计算机、平板计算机、机顶盒、电视、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机等。
在一些实施例中,编码设备110可以经由信道130将编码后的视频数据(如码流)传输给解码设备120。信道130可以包括能够将编码后的视频数据从编码设备110传输到解码设备120的一个或多个媒体和/或装置。
在一个实例中,信道130包括使编码设备110能够实时地将编码后的视频数据直接发射到解码设备120的一个或多个通信媒体。在此实例中,编码设备110可根据通信标准来调制编码后的视频数据,且将调制后的视频数据发射到解码设备120。其中通信媒体包含无线通信媒体,例如射频频谱,可选的,通信媒体还可以包含有线通信媒体,例如一根或多根物理传输线。
在另一实例中,信道130包括存储介质,该存储介质可以存储编码设备110编码后的视频数据。存储介质包含多种本地存取式数据存储介质,例如光盘、DVD、快闪存储器等。在该实例中,解码设备120可从该存储介质中获取编码后的视频数据。
在另一实例中,信道130可包含存储服务器,该存储服务器可以存储编码设备110编码后的视频数据。在此实例中,解码设备120可以从该存储服务器中下载存储的编码后的视频数据。可选的,该存储服务器可以存储编码后的视频数据且可以将该编码后的视频数据发射到解码设备120,例如web服务器(例如,用于网站)、文件传送协议(FTP)服务器等。
一些实施例中,编码设备110包含视频编码器112及输出接口113。其中,输出接口113可以包含调制器/解调器(调制解调器)和/或发射器。
在一些实施例中,编码设备110除了包括视频编码器112和输入接口113外,还可以包括视频源111。
视频源111可包含视频采集装置(例如,视频相机)、视频存档、视频输入接口、计算机图形系统中的至少一个,其中,视频输入接口用于从视频内容提供者处接收视频数据,计算机图形系统用于产生视频数据。
视频编码器112对来自视频源111的视频数据进行编码,产生码流。视频数据可包括一个或多个图像(picture)或图像序列(sequence of pictures)。码流以比特流的形式包含了图像或图像序列的编码信息。编码信息可以包含编码图像数据及相关联数据。相关联数据可包含序列参数集(sequence parameter set,简称SPS)、图像参数集(picture parameter set,简称PPS)及其它语法结构。SPS可含有应用于一个或多个序列的参数。PPS可含有应用于一个或多个图像的参数。语法结构是指码流中以指定次序排列的零个或多个语法元素的集合。
视频编码器112经由输出接口113将编码后的视频数据直接传输到解码设备120。编码后的视频数据还可存储于存储介质或存储服务器上,以供解码设备120后续读取。
在一些实施例中,解码设备120包含输入接口121和视频解码器122。
在一些实施例中,解码设备120除包括输入接口121和视频解码器122外,还可以包括显示装置123。
其中,输入接口121包含接收器及/或调制解调器。输入接口121可通过信道130接收编码后的视频数据。
视频解码器122用于对编码后的视频数据进行解码,得到解码后的视频数据,并将解码后的视频数据传输至显示装置123。
显示装置123显示解码后的视频数据。显示装置123可与解码设备120整合或在解码设备120外部。显示装置123可包括多种显示装置,例如液晶显示器(LCD)、等离子体显示器、有机发光二极管(OLED)显示器或其它类型的显示装置。
此外,图1仅为实例,本申请实施例的技术方案不限于图1,例如本申请的技术还可以应用于单侧的视频编码或单侧的视频解码。
在一些实施例中,上述视频编码器112可应用于亮度色度(YCbCr,YUV)格式的图像数据上。例如,YUV比例可以为4:2:0、4:2:2或者4:4:4,Y表示明亮度(Luma),Cb(U)表示蓝色色度,Cr(V)表示红色色度,U和V表示为色度(Chroma)用于描述色彩及饱和度。例如,在颜色格式上,4:2:0表示每4个像素有4个亮度分量,2个色度分量(YYYYCbCr),4:2:2表示每4个像素有4个亮度分量,4个色度分量(YYYYCbCrCbCr),4:4:4表示全像素显示(YYYYCbCrCbCrCbCrCbCr)。
由于视频的一个帧中的相邻像素之间存在很强的相关性,在视频编解码技术中使用帧内预测的方法消除相邻像素之间的空间冗余。由于视频中的相邻帧之间存在着很强的相似性,在视频编解码技术中使用帧间预测方法消除相邻帧之间的时间冗余,从而提高编码效率。
本申请实施例可用于帧间编码,用于提高帧间编码的效率。
视频编码技术主要针对序列化视频数据进行编码,主要用于互联网时代的数据存储、传输和呈现等应用。视频在现阶段占据了85%以上的流量空间与入口,随着未来用户对视频数据分辨率、帧率以及维度等需求的增加,未来视频编码技术所承载的作用与价值也将大幅提升,对于视频编码的技术提升与需求是巨大的机遇和挑战。传统视频编码技术经历了几十年的发展与变革,在每一个时代都极大地满足和服务于世界的视频服务。传统视频编码技术在基于多尺度块级的混合编码框架下迭代更新并沿用至今,伴随着硬件技术的飞速发展,视频编码通过子技术的提升,在牺牲一定复杂度的情况下,带来了极大的编码性能提升。然而,置换复杂度获取性能的方式由于硬件发展的瓶颈逐渐有了较为明显的限制,对硬件设计和更新提出了更高的要求,使得现在商用的传统编解码器通常需要进行一定的简化使用。
同时,深度学习技术尤其是深度神经网络技术的日趋成熟,在视频图像的多个任务上都有着广泛的研究和使用,包括视频增强、视频检测以及视频分割等。而应用于视频编码领域的深度学习技术最初主要集中于传统视频编码中子技术的研究与替换,通过研究传统视频编码中的相关模块,以原有的视频编码框架作为数据生成工具得到成对的训练数据对相应的神经网络进行训练,并在最终神经网络收敛后用于替换对应的模块,其中可替换的模块如环路内滤波、环路外滤波、编码块划分、编码块预测等。但是,目前的基于神经网络的视频压缩技术,压缩效果不佳。
为了进一步提高视频的压缩效果,本申请提出一种纯数据驱动的神经网络编码框架,即整个编解码系统都基于深度神经网络设计、训练并最终用于视频编码,并采用了一种全新的混合有损运动表达方式实现了基于神经网络的帧间编解码技术。
下面结合具体的实施例对本申请实施例提供的技术方案进行详细描述。
首先结合图2,以解码端为例进行介绍。
图2为本申请实施例提供的一种视频解码方法的流程示意图,本申请实施例应用于图1所示视频解码器。如图2所示,本申请实施例的方法包括:
S201、解码第一码流,确定量化后的第一特征信息。
其中,第一特征信息是对当前图像和当前图像的前一重建图像进行特征融合得到的。
本申请实施例提出一种基于神经网络的解码器,该基于神经网络的解码器与基于神经网络的编码器进行端到端训练得到。
本申请实施例中,当前图像的前一重建图像可以理解为视频序列中,位于当前图像之前的前一帧图像,该前一帧图像已解码重建。
由于当前图像和当前图像的前一重建图像这两个相邻帧之间存在着很强的相似性,因此,编码端在编码时,将当前图像和当前图像的前一重建图像进行特征融合,得到第一特征信息。例如,编码端将当前图像和当前图像的前一重建图像进行级联,将级联后的图像进行特征提取,得到第一特征信息。示例性的,编码端通过特征提取模块对级联后的图像进行特征提取,得到该第一特征信息。本申请对特征提取模块的具体网络结构不做限制。上述得到的第一特征信息为浮点型,例如为32位浮点数表示,进一步的,为了降低编码代价,则编码端对上述得到的第一特征信息进行量化,得到量化后的第一特征信息。接着,对量化后的第一特征信息进行编码,得到第一码流,例如,编码端对第一特征信息进行算数编码,得到第一码流。这样,解码端得到第一码流后,对该第一码流进行解码,得到量化后的第一特征信息,并根据该量化后的第一特征信息,得到当前图像的重建图像。
本申请实施例中,对上述S201中解码端解码第一码流,确定量化后的第一特征信息的方式包括但不限于如下几种:
方式一,若编码端直接使用量化后的第一特征信息的概率分布,对量化后的第一特征信息进行编码,得到第一码流。对应的,则解码端直接对第一码流进行解码,得到量化后的第一特征信息。
上述量化后的第一特征信息所包括冗余信息量较多,直接对量化后的第一特征信息进行编码时,编码所需的码字多,编码代价大。为了降低编码代价,在一些实施例中,编码端根据第一特征信息进行特征变换,得到第二特征信息,并对第二特征信息进行量化后再编码,得到第二码流;对该第二码流进行解码,得到量化后的第二特征信息,并根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布;进而根据量化后的第一特征信息的概率分布,对量化后的第一特征信息进行编码,得到第一码流。也就是说,为了降低编码代价,则编码端确定第一特征信息对应的超先验特征信息,即第二特征信息,并基于该第二特征信息确定量化后的第一特征信息的概率分布,由于第二特征信息为第一特征信息的超先验特征信息,所包括的冗余量较少,这样基于该冗余量较少的第二特征信息确定量化后的第一 特征信息的概率分布,并使用该概率分布对第一特征信息进行编码,可以降低第一特征信息的编码代价。
基于上述描述,解码端可以通过如下方式二的步骤,确定量化后的第一特征信息。
方式二,上述S201包括如下S201-A至S201-C的步骤:
S201-A、解码第二码流,得到量化后的第二特征信息。
其中,第二特征信息是对第一特征信息进行特征变换得到的。
由上述可知,编码端为了降低编码代价,对第一特征信息进行特征变换,得到该第一特征信息的超先验特征信息,即第二特征信息,使用该第二特征信息确定量化后的第一特征信息的概率分布,并使用该概率分布对量化后的第一特征信息进行编码,得到第一码流。同时,为了使解码端采用与编码相同的概率分布对第一码流进行解码,则对上述第二特征信息进行编码,得到第二码流。也就是说,在该方式二中,编码端生成两个码流,分别为第一码流和第二码流。
这样解码端得到第一码流和第二码流后,首先解码第二码流,确定量化后的第一特征信息的概率分布,具体是,解码第二码流,得到量化后的第二特性信息,根据该量化后的第二特征信息,确定量化后的第一特征信息的概率分布。接着,解码端使用确定出的概率分布对第一码流进行解码,得到量化后的第一特征信息,进而实现对第一特征信息的准确解码。
本申请中,由于第二特征信息为第一特征信息的超先验特征信息,所包括的冗余信息较少,因此,编码端在编码时,可以直接使用量化后的第二特征信息的概率分布,对量化后的第二特征信息进行编码,得到第二码流。对应的,解码端在解码时,直接对该第二码流进行解码,即可得到量化后的第二特征信息。
S201-B、根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布。
解码端根据上述步骤,确定出量化后的第二特征信息后,根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布。
本申请实施例,对上述S201-B中根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布的具体方式不做限制。
在一些实施例中,由于上述第二特征信息是对第一特征信息进行特征变换得到的,基于此,S201-B包括如下S201-B1至S201-B3的步骤:
S201-B1、对量化后的第二特征信息进行反变换,得到重建特征信息。
在该实现方式中,解码端对量化后的第二特征信息进行反变换,得到重建特征信息,其中,解码端所采用的反变换方式可以理解为编码端采用的变换方式的逆运算。例如,编码端对第一特征信息进行N次特征提取,得到第二特征信息,对应的,解码端对量化后的第二特征信息进行N次反向的特征提取,得到反变换后的特征信息,记为重建特征信息。
本申请实施例对解码端采用反变换方式不做限制。
在一些实施例中,解码端采用的反变换方式包括N次特征提取。也就是说,解码端对得到的量化后的第二特征信息进行N次特征提取,得到重建特征信息。
在一些实施例中,解码端采用的反变换方式包括N次特征提取和N次上采样。也就是说,解码端对得到的量化后的第二特征信息进行N次特征提取和N次上采样,得到重建特征信息。
本申请实施例对上述N次特征提取和N次上采样的具体执行顺序不做限制。
在一种示例中,解码端可以先对量化后的第二特征信息进行N次连续的特征提取后,再进行N次连续的上采样。
在另一种示例中,上述N次特征提取和N次上采样穿插进行,即执行一次特征提取后执行一次上采样。举例说明,假设N=2,则解码端对量化后的第二特征信息进行反变换,得到重建特征信息的具体过程是:将量化后的第二特征信息输入第一个特征提取模块中进行第一次特征提取,得到特征信息1,对特征信息1进行上采样,得到特征信息2,将特征信息2输入第二个特征提取模块中进行第二次特征提取,得到特征信息3,对特征信息3进行上采样,得到特征信息4,将该特征信息4记为重建特征信息。
需要说明的是,本申请实施例对解码端所采用的N次特征提取方式不做限制,例如包括多层卷积、残差连接、密集连接等特征提取方式中的至少一种。
在一些实施例中,解码端通过非局部注意力方式来进行特征提取,此时,上述S201-B1包括如下S201-B11的步骤:
S201-B11、对量化后的第二特征信息进行N次非局部注意力变换和N次上采样,得到重建特征信息,N为正整数。
由于非局部注意力方式可以实现更高效的特征提取,能使得提取的特征保留更多的信息,且计算效率高,因此,本申请实施例中,解码端采用非局部注意力的方式对量化后的第二特征信息进行特征提取,以实现对量化后的第二特征信息的快速和准确特征提取。另外,编码端在根据第一特征信息生成第二特征信息时,进行了N次下采样,因此,解码端对应的执行N次上采样,以使重建得到的重建特征信息与第一特征信息的大小一致。
在一些实施例中,如图3所示,解码端通过反变换模块得到重建特征信息,该反变换模块包括N个非局部注意力模块和N个上采样模块。其中,非局部注意力模块用于实现非局部注意力变换,上采样模块用于实现上采样。示例性的,如图3所示,一个非局部注意力模块后,连接一个上采样模块。在实际应用时,解码端将解码得到的量化后的第二特征信息输入反变换模块中,反变换模块中的第一个非局部注意力模块对量化后的第二特征信息进行非局部注意力特征变换提取,得到特征信息1,再将特征信息1输入第一个上采样模块进行上采样,得到特征信息2。接着,将特征信息2输入第二个非局部注意力模块进行非局部注意力特征变换提取,得到特征信息3,再将特征信息3输入第二个上采样模块进行上采样,得到特征信息4。依次类推,得到第N个上采样模块输出的特征信息,并将该特征信息确定为重建特征信息。
S201-B2、确定重建特征信息的概率分布。
由上述可知,第二量化特征信息是对第一特征信息进行变换得到的,解码端通过上述步骤,对量化后的第二特征信息进行反量化,得到重建特征信息,因此,该重建特征信息可以理解为第一特征信息的重建信息,也就是说,重建特征信息的概率分布与量化后的第一特征信息的概率分布相似或相关,这样,解码端可以先确定出重建特征信息的概率分布,进而根据该重建特征信息的概率分布,预测量化后的所述第一特征信息的概率分布。
在一些实施例中,重建特征信息的概率分布为正态分布或高斯分布,此时,确定重建特征信息的概率分布的过程为,根据重建特征信息中的各特征值,确定该重建特征信息的均值和方差矩阵,根据均值和方差矩阵,生成该重建特征信息的高斯分布。
S201-B3、根据重建特征信息的概率分布,预测得到量化后的第一特征信息的概率分布。
由于重建特征信息为第一特征信息的重建信息,重建特征信息的概率分布与量化后的第一特征信息的概率分布相似或相关,因此,本申请实施例通过该重建特征信息的概率分布,可以实现对量化后的第一特征信息的概率分布的准确预测。
本申请实施例对上述S201-B3的具体实现方式不做限制。
在一种可能的实现方式中,将重建特征信息的概率分布,确定为量化后的第一特征信息的概率分布。
在另一种可能的实现方式中,根据重建特征信息的概率分布,预测量化后的第一特征信息中编码像素的概率;根据量化后的第一特征信息中编码像素的概率,得到量化后的第一特征信息的概率分布。
S201-C、根据量化后的第一特征信息的概率分布,对第一码流进行解码,得到量化后的第一特征信息。
根据上述步骤,确定出量化后的第一特征信息的概率分布后,使用该概率分布对第一码流进行解码,进而实现对量化后的第一特征信息的准确解码。
本申请实施例中,解码端根据上述方式一或方式二,解码第一码流,确定出量化后的第一特征信息后,执行如下S202的步骤。
S202、对量化后的第一特征信息进行多级时域融合,得到混合时空表征。
本申请实施例中,为了提高重建图像的准确性,对量化后的第一特征信息进行多级的时域融合,即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合,并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,例如将t-1时刻、t-2时刻…、t-k时刻等多个时刻的重建图像与量化后的第一特征信息进行融合。这样可以避免当前图像的前一重建图像中的某信息被遮挡时,被遮挡的信息可以从当前图像之前的几张重建图像中得到,进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成当前图像的P个预测图像时,可以提高生成的预测图像的准确性,进而基于该准确的预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
本申请实施例对解码端对量化后的第一特征信息进行多级时域融合,得到混合时空表征的具体方式不做限制。
在一些实施例中,解码端通过递归聚合模块混合时空表征,即上述S202包括如下S202-A的步骤:
S202-A、解码端通过递归聚合模块将量化后的第一特征信息,与前一时刻递归聚合模块的隐式特征信息进行融合,得到混合时空表征。
本申请实施例的递归聚合模块在每次生成混合时空表示时,会学习且保留从本次特征信息中所学习到的深层次特征信息,且将学习到的深层次特征作为隐式特征信息作用于下一次的混合时空表征生成,进而提高生成的混合时空表征的准确性。也就是说,本申请实施例中,前一时刻递归聚合模块的隐式特征信息包括了递归聚合模块所学习到的当前图像之前的多张重建图像的特征信息,这样,解码端通过递归聚合模块将量化后的第一特征信息,与前一时刻递归聚合模块的隐式特征信息进行融合,可以生成更加准确、丰富和详细的混合时空表征。
本申请实施例对递归聚合模块的具体网络结构不做限制,例如为可以实现上述功能的任意网络结构。
在一些实施例中,递归聚合模块由至少一个时空递归网络ST-LSTM堆叠而成,此时,上述混合时空表征Gt的表达公式如公式(1)所示:
Figure PCTCN2022090468-appb-000001
其中,
Figure PCTCN2022090468-appb-000002
为量化后的第一特征信息,h为ST-LSTM所包括的隐式特征信息。
在一种示例中,假设递归聚合模块包括2个ST-LSTM组成,如图4所示,解码端将重建得到的量化后的第一特征信息
Figure PCTCN2022090468-appb-000003
输入递归聚合模块中,递归聚合模块中的2个ST-LSTM依次对量化后的第一特征信息
Figure PCTCN2022090468-appb-000004
进行处理,生成一特征信息,具体的,如图4所示,第一个ST-LSTM生成的隐式特征信息h1作为下一个ST-LSTM的输入,且两个ST-LSTM在本次运算过程中分别生成传输带的更新值c1和c2以对各自的传输带值进行更新,其中m为记忆信息,在两个ST-LSTM之间进行传递,最终得到第二个ST-LSTM输出的特征信息h2。进一步的,为了提高生成的混合时空表征的准确新,则将第二个ST-LSTM生成的特征信息h2与量化后的第一特征信息
Figure PCTCN2022090468-appb-000005
进行残差连接,即将第二个ST-LSTM生成的特征信息h与量化后的第一特征信息
Figure PCTCN2022090468-appb-000006
进行相加,生成混合时空表征Gt。
解码端根据上述方法,得到混合时空表征后,执行如下S203。
S203、根据混合时空表征对前一重建图像进行运动补偿,得到当前图像的P个预测图像。
其中,P为正整数。
由上述可知,本申请实施例的混合时空表征融合的当前图像以及当前图像之前的多个重建图像的特征信息,这样根据该混合时空表征对前一重建图像进行运动补偿,可以得到精确的当前图像的P个预测图像。
本申请实施例对生成的P个预测图像的具体数量不做限制。即本申请实施例中,解码端可以采用不同的方式,根据混合时空表征对前一重建图像进行运动补偿,得到当前图像的P个预测图像。
本申请实施例对上述解码端根据混合时空表征对前一重建图像进行运动补偿的具体的方式不做限制。
在一些实施例中,上述P个预测图像中包括第一预测图像,该第一预测图像是解码端采用光流运动补偿方式得到的,此时,上述S203包括如下S203-A1和S203-A2的步骤:
S203-A1、根据混合时空表征,确定光流运动信息;
S203-A2、根据光流运动信息对前一重建图像进行运动补偿,得到第一预测图像。
本申请实施例对解码端根据混合时空表征,确定光流运动信息的具体方式不做限制。
在一些实施例中,解码端通过预先训练好的神经网络模型得到光流运动信息,即该神经网络模型可以基于混合时空表征,预测出光流运动信息。在一些实施例中,该神经网络模型可以称为第一解码器,或光流信号解码器Df。解码端将混合时空表征Gt输入该光流信号解码器Df中进行光流运动信息的预测,得到该光流信号解码器Df输出的光流运动信息f x,y。可选的,该f x,y为通道为2的光流运动信息。
示例性的,f x,y的生成公式如公式(2)所示:
f x,y=D f(G t)   (2)
本申请实施例对上述光流信号解码器Df的具体网络结构不做限制。
在一些实施例中,光流信号解码器Df由多个NLAM和多个上采样模块组成,示例性的,如图5所示,光流信号解码器Df包括1个NLAM、3个LAM和4个下采样模块,其中一个NLAM之后连接一个下采样模块,且一个LAM之后连接一个下采样模块。可选的,NLAM包括多个卷积层,例如包括3个卷积层,每个卷积层的卷积核大小为3*3,通道数为192。可选的,3个LAM分别包括多个卷积层,例如分别包括3个卷积层,每个卷积层的卷积核大小为3*3,3个LAM所包括的卷积层的通道数依次为128、96和64。可选的,4个下采样模块分别包括一个卷积层Conv,该卷积层的卷积核大小为5*5,4个下采样模块所包括的卷积层的通道数依次为128、96、64和2。这样,解码端将混合时空表征Gt输入该光流信号解码器Df中,NLAM对该时空表征Gt进行特征提取,得到一个通道数为192的特征信息a,并将该特征信息a输入第一个下采样模块中进行下采样,得到通道数为128的特征信息b。接着,将特征信息b输入第一个LAM中进行特征再提取,得到通道数为128的特征信息c,并将该特征信息c输入第二个下采样模块中进行下采样,得到通道数为96的特征信息d。接着,将特征信息d输入第二个LAM中进行特征再提取,得到通道数为96的特征信息e,并将该特征信息e输入第三个下采样模块中进行下采样,得到通道数为64的特征信息f。接着,将特征信息f输入第三个LAM中进行特征再提取,得到通道数为64的特征信息g,并将该特征信息g输入第四个下采样模块中进行下采样,得到通道数为2的特征信息j,特征信息j即为光流运动信息。
需要说明的是,上述图5只是一种示例中,且图5中各参数的设定也仅为示例,本申请实施例的光流信号解码器Df的网络结构包括但不限于图5所示。
解码端生成光流运动信息f x,y后,使用光流运动信息f x,y对前一重建图像
Figure PCTCN2022090468-appb-000007
进行运动补偿,得到第一预测图像X 1
本申请实施例对解码端根据光流运动信息对前一重建图像进行运动补偿,得到第一预测图像的具体方式不做限制,例如,解码端使用光流运动信息f x,y对前一重建图像
Figure PCTCN2022090468-appb-000008
进行线性插值,将插值生成的图像记为第一预测图像X 1
在一种可能的实现方式中,解码端通过如下公式(3),得到第一预测图像X 1
Figure PCTCN2022090468-appb-000009
在该实现方式中,如图5所示,解码端通过Warping(扭曲)运算,使用光流运动信息fxy对前一重建图像
Figure PCTCN2022090468-appb-000010
进行运动补偿,得到第一预测图像X 1
在一些实施例中,上述P个预测图像中包括第二预测图像,该第二预测图像是解码端采用偏移运动补偿方式得到的,此时,上述S203包括如下S203-B1至S203-B3的步骤:
S203-B1、根据混合时空表征,得到当前图像对应的偏移量;
S203-B2、对前一重建图像进行空间特征提取,得到参考特征信息;
S203-B3、使用偏移量对参考特征信息进行运动补偿,得到第二预测图像。
本申请实施例对解码端根据混合时空表征,得到当前图像对应的偏移量的具体方式不做限制。
在一些实施例中,解码端通过预先训练好的神经网络模型得到当前图像对应的偏移量,即该神经网络模型可以基于混合时空表征,预测出偏移量,该偏移量为有损的偏移量信息。在一些实施例中,该神经网络模型可以称为第二解码器,或可变卷积解码器Dm。解码端将混合时空表征Gt输入该可变卷积解码器Dm中进行偏移量信息的预测。
同时,解码端对前一重建图像进行空间特征提取,得到参考特征信息。例如,解码端通过空间特征提取模块SFE对前一重建图像进行空间特征提取,得到参考特征信息。
接着,解码端使用偏移量对提取得到的参考特征信息进行运动补偿,得到当前图像的第二预测图像。
本申请实施例对解码端使用偏移量对提取得到的参考特征信息进行运动补偿,得到当前图像的第二预测图像的具体方式不做限制。
在一种可能的实现方式中,解码端使用偏移量,对参考特征信息进行基于可变形卷积的运动补偿,得到第二预测图像。
在一些实施例中,由于可变换卷积可以基于混合时空表征,生成当前图像对应的偏移量,因此,本申请实施例中,解码端将混合时空表征Gt,以及参考特征信息输入该可变换卷积中,该可变换卷积基于混合时空表征Gt生成当前图像对应的偏移量,且将该偏移量作用在参考特征信息上进行运动补偿,进而得到第二预测图像。
基于此,示例性的,如图6所示,本申请实施例的可变卷积解码器Dm包括可变换卷积DCN,解码端将前一重建图像
Figure PCTCN2022090468-appb-000011
输入反变换模块SFE中进行时空特征提取,得到参考特征信息。接着,将混合时空表征Gt,以及参考特征信息输入可变换卷积DCN中进行偏移量的提取以及运动补偿,得到第二预测图像X 2
示例性的,解码端通过如公式(4)生成第二预测图像X 2
Figure PCTCN2022090468-appb-000012
本申请实施例对上述光流信号解码器Df的具体网络结构不做限制。
在一些实施例中,如图6所示,为了进一步提高第二预测图像的准确性,则可变卷积解码器Dm除了包括可变换卷积DCN外,还包括1个NLAM、3个LAM和4个下采样模块,其中一个NLAM之后连接一个下采样模块,且一个LAM之后连接一个下采样模块。可选的,可变卷积解码器Dm所包括的1个NLAM、3个LAM和前3个下采样模块的网络结构与上述光流信号解码器Df所包括的1个NLAM、3个LAM和前3个下采样模块的网络结构相同,在此不再赘述。可选的,可变卷积解码器Dm包括的最后一个下采样模块所包括的通道数为5。
需要说明的是,上述图6只是一种示例中,且图6中各参数的设定也仅为示例,本申请实施例的可变卷积解码器Dm的网络结构包括但不限于图6所示。
本申请实施例中,如图6所示,解码端首先将前一重建图像
Figure PCTCN2022090468-appb-000013
输入反变换模块SFE中进行时空特征提取,得到参考特征信息。接着,将混合时空表征Gt,以及参考特征信息输入可变卷积解码器Dm中的可变换卷积DCN中进行偏移量的提取以及运动补偿,得到一个特征信息,将该特征信息输入NLAM中,经过NLAM、3个LAM以及4个下采样模块的特征提取,最终还原为第二预测图像X 2
根据上述方法,解码端可以确定出P个预测图像,例如确定出第一预测图像和第二预测图像,接着,执行如下S204的步骤。
S204、根据P个预测图像,确定当前图像的重建图像。
在一些实施例中,若上述P个预测图像包括一个预测图像时,则根据该预测图像,确定当前图像的重建图像。
例如,将该预测图像与当前图像的前一个或几个重建图像进行比较,计算损失,若该损失小,则说明该预测图像的预测精度较高,可以将该预测图像确定为当前图像的重建图像。
再例如,若上述损失大,则说明该预测图像的预测精度较低,此时,可以根据当前图像的前一个或几个重建图像和该预测图像,确定当前图像的重建图像,例如,将该预测图像和当前图像的前一个或几个重建图像输入一神经网络中,得到当前图像的重建图像。
在一些实施例中,上述S204包括如下S204-A和S204-B的步骤:
S204-A、根据P个预测图像,确定当前图像的目标预测图像。
在该实现方式中,解码端首先根据P个预测图像,确定当前图像的目标预测图像,接着,根据该当前图像的目标预测图像实现当前图像的重建图像,进而提高重建图像的确定准确性。
本申请实施例对根据P个预测图像,确定当前图像的目标预测图像的具体方式不做限制。
在一些实施例中,若P=1,则将该一个预测图像确定为当前图像的目标预测图像。
在一些实施例中,若P大于1,则S204-A包括S204-A11和S204-A12:
S204-A11、根据P个预测图像,确定加权图像;
在该实现方式中,若根据上述方法,生成当前图像的多个预测图像,例如生成第一预测图像和第二预测图像时,则对这P个预测图像进行加权,生成加权图像,则根据该加权图像,得到目标预测图像。
本申请实施例对根据P个预测图像,确定加权图像的具体方式不做限制。
例如,确定P个预测图像对应的权重;并根据P个预测图像对应的权重,对P个预测图像进行加权,得到加权图像。
示例性的,若P个预测图像包括第一预测图像和第二预测图像,则解码端确定第一预测图像对应的第一权重和第二预测图像对应的第二权重,根据第一权重和所述第二权重,对第一预测图像和第二预测图像进行加权,得到加权图像。
其中,确定P个预测图像对应的权重的方式包括但不限于如下几种:
方式一,上述P个预测图像对应的权重为预设权重。假设P=2,即第一预测图像对应的第一权重和第二预测图像对应的第二权重可以是,第一权重等于第二权重,或者第一权重与第二权重的比值为1/2、1/4、1/2、1/3、2/1、3/1、4/1等等。
方式二,解码端根据混合时空表征进行自适应掩膜,得到P个预测图像对应的权重。
示例性的,解码端通过神经网络模型,生成P个预测图像对应的权重,该神经网络模型为预先训练好的,可以用于生成P个预测图像对应的权重。在一些实施例中,该神经网络模型也称为第三解码器或自适应掩膜补偿解码器D w。具体的,解码端将混合时空表征输入该自适应掩膜补偿解码器D w中进行自适应掩膜,得到P个预测图像对应的权重。例如,解码端将混合时空表征Gt输入该自适应掩膜补偿解码器D w中进行自适应掩膜,自适应掩膜补偿解码器D w输出第一预测图像的第一权重w1和第二预测图像的第二权重w2,进行根据第一权重w1和第二权重w2对上述得到第一预测图像X 1和第二预测图像X 2,能自适应地选择相应代表预测帧中不同区域地信息,进而生成加权图像。
示例性的,根据如下公式(5)生成加权图像X 3
X 3=w 1*X 1+w 2*X 2  (5)
在一些实施例中,上述P个预测图像对应的权重为一个矩阵,包括了预测图像中每个像素点对应的权重,这样在生成加权图像时,针对当前图像中的每个像素点,将P个预测图像中该像素点分别对应的预测值及其权重进行加权运算,得到该像素点的加权预测值,这样当前图像中每个像素点对应的加权预测值组成当前图像的加权图像。
本申请实施例对上述自适应掩膜补偿解码器D w的具体网络结构不做限制。
在一些实施例中,如图7所示,自适应掩膜补偿解码器D w包括1个NLAM、3个LAM、4个下采样模块和一个sigmoid函数,其中一个NLAM之后连接一个下采样模块,一个LAM之后连接一个下采样模块。可选的,自适应掩 膜补偿解码器D w所包括的1个NLAM、3个LAM、4个下采样模块与上述可变卷积解码器Dm所包括的1个NLAM、3个LAM、4个下采样模块的网络结构一致,在此不再赘述。
需要说明的是,上述图7只是一种示例中,且图7中各参数的设定也仅为示例,本申请实施例的自适应掩膜补偿解码器D w的网络结构包括但不限于图7所示。
在该实现方式中,解码端根据上述方法,对P个预测图像进行加权,得到加权图像后,执行如下S204-A12。
S204-A12、根据加权图像,得到目标预测图像。
例如,将该加权图像,确定为目标预测图像。
在一些实施例中,解码端还可以根据混合时空表征,得到当前图像的残差图像。
示例性的,解码端通过神经网络模型,得到当前图像的残差图像,该神经网络模型为预先训练好的,可以用于生成当前图像的残差图像。在一些实施例中,该神经网络模型也称为第四解码器或空间纹理增强解码器Dt。具体的,解码端将混合时空表征输入该空间纹理增强解码器Dt中进行空间纹理增强,得到当前图像的残差图像X r=D_t(G t),该残差图像X r可以对预测图像进行纹理增强。
本申请实施例中,对上述空间纹理增强解码器Dt的具体网络结构不做限制。
在一些实施例中,如图8所示,空间纹理增强解码器Dt包括1个NLAM、3个LAM、4个下采样模块,其中一个NLAM之后连接一个下采样模块,一个LAM之后连接一个下采样模块。可选的,空间纹理增强解码器Dt所包括的1个NLAM、3个LAM、前3个下采样模块与上述光流信号解码器Df所包括的1个NLAM、3个LAM、前3个下采样模块的网络结构一致,在此不再赘述。空间纹理增强解码器Dt包括的最后一个下采样模块包括的通道数为3。
需要说明的是,上述图8只是一种示例中,且图8中各参数的设定也仅为示例,本申请实施例的空间纹理增强解码器Dt的网络结构包括但不限于图8所示。
由于上述残差图像X r可以对预测图像进行纹理增强。基于此,在一些实施例中,上述S204-A中根据P个预测图像,确定当前图像的目标预测图像包括如下S204-A21的步骤:
S204-A21、根据P个预测图像和残差图像,得到目标预测图像。
例如,若P=1,则根据该预测图像和残差图像,得到目标预测图像,例如,将该预测图像与残差图像进行相加,生成目标预测图像。
再例如,若P大于1时,则首先根据P个预测图像,确定加权图像;再根据加权图像和残差图像,确定目标预测图像。
其中,解码端根据P个预测图像,确定加权图像的具体过程可以参照上述S204-A11的具体描述,在此不再赘述。
举例说明,以P=2为例,根据上述方法,确定出第一预测图像对应的第一权重w1和第二预测图像对应的第二权重w2,可选的,根据上述公式(5)对第一预测图像和第二预测图像进行加权,得到加权图像X 3,接着,使用残差图像X r对加权图像X 3进行增强,得到目标预测图像。
示例性的,根据如下公式(6)生成目标预测图像X 4
X 4=X 3+X r  (6)
根据上述方法,解码端确定出当前图像的目标预测图像后,执行如下S204-B的步骤。
S204-B、根据目标预测图像,确定当前图像的重建图像。
在一些实施例中,将该目标预测图像与当前图像的前一个或几个重建图像进行比较,计算损失,若该损失小,则说明该目标预测图像的预测精度较高,可以将该目标预测图像确定为当前图像的重建图像。若上述损失大,则说明该目标预测图像的预测精度较低,此时,可以根据当前图像的前一个或几个重建图像和该目标预测图像,确定当前图像的重建图像,例如,将该目标预测图像和当前图像的前一个或几个重建图像输入一神经网络中,得到当前图像的重建图像。
在一些实施例中,为了进一步提高重建图像的确定准确性,则本申请实施例还包括残差解码,此时,上述S204-B包括如下S204-B1和S204-B2的步骤:
S204-B1、对残差码流进行解码,得到当前图像的残差值;
S204-B2、根据目标预测图像和残差值,得到重建图像。
本申请实施例中,为了提高重建图像的效果,则编码端还通过残差编码的方式,生成残差码流,具体是,编码端确定当前图像的残差值,对该残差值进行编码生成残差码流。对应的,解码端对残差码流进行解码,得到当前图像的残差值,并根据目标预测图像和残差值,得到重建图像。
本申请实施例对上述当前图像的残差值的具体表示形式不做限制。
在一种可能的实现方式中,当前图像的残差值为一个矩阵,该矩阵中的每个元素为当前图像中每个像素点对应的残差值。这样,解码端可以逐像素的,将目标预测图像中每个像素点对应的残差值和预测值进行相加,得到每个像素点的重建值,进而得到当前图像的重建图像。以当前图像中的第i个像素点为例,在目标预测图像中,得到该第i个像素点对应的预测值,以及从当前图像的残差值中得到该第i个像素点对应的残差值,接着,将该第i个像素点对应的预测值和残差值进行相加,得到该第i个像素点对应的重建值。针对当前图像中的每个像素点,参照上述第i个像素点,可以得到当前图像中每个像素点对应的重建值,当前图像中每个像素点对应的重建值,组成当前图像的重建图像。
本申请实施例对解码端得到当前图像的残差值的具体方式不做限制,也就是说,本申请实施例对编解码两端所采用的残差编解码的方式不做限制。
在一种示例中,编码端根据上述与解码端相同的方式,确定出当前图像的目标预测图像,接着,根据当前图像和目标预测图像,得到当前图像的残差值,例如,将当前图像和目标预测图像的差值确定为当前图像的残差值。接着,对当前图像的残差值进行编码,生成残差编码。可选的,可以对当前图像的残差值进行变换,得到变换系数,对变换 系数进行量化得到量化系数,对量化系数进行编码,得到残差码流。对应的,解码端解码残差码流,得到当前图像的残差值,例如解码残差码流,得到量化系数,对量化系数进行反量化和反变换,得到当前图像的残差值。接着,再根据上述方法,将目标预测图像和当前图像对应的残差值进行相加,得到当前图像的重建图像。
在一些实施例中,编码端可以采用神经网络的方法,对当前图像和当前图像的目标预测图像进行处理,生成当前图像的残差值,并对当前图像的残差值进行编码,生成残差码流。对应的,解码端解码该残差码流,得到当前图像的残差值,接着,再根据上述方法,将目标预测图像和当前图像对应的残差值进行相加,得到当前图像的重建图像。
本申请实施例中,解码端根据上述方法,可以得到当前图像的重建图像。
可选的,可以将该重建图像进行直接显示。
可选的,还可以将该重建图像存入缓存中,用于后续图像的解码。
本申请实施例提供的视频解码方法,解码端通过解码第一码流,确定量化后的第一特征信息,第一特征信息是对当前图像和当前图像的前一重建图像进行特征融合得到的;对量化后的第一特征信息进行多级时域融合,得到混合时空表征;根据混合时空表征对前一重建图像进行运动补偿,得到当前图像的P个预测图像,P为正整数;根据P个预测图像,确定当前图像的重建图像。本申请,为了提高重建图像的准确性,通过对量化后的第一特征信息进行多级时域融合,即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合,并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,这样可以避免当前图像的前一重建图像中的某信息被遮挡时,被遮挡的信息可以从当前图像之前的几张重建图像中得到,进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时,可以生成高精度的P个预测图像时,基于该高精度的P个预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
本申请实施例中,提出一种端到端的基于神经网络的编解码框架,该基于神经网络的编解码框架包括基于神经网络的编码器和基于神经网络的解码器。下面结合的本申请一种可能的基于神经网络的解码器,对本申请实施例的解码过程进行介绍。
图9为本申请一实施例涉及的一种基于神经网络的解码器的网络结构示意图,包括:反变换模块、递归聚合模块和混合运动补偿模块。
其中,反变换模块用于对量化后的第二特征信息进行反变换,得到第一特征信息的重建特征信息,示例性的,其网络结构如图3所示。
递归聚合模块用于对量化后的第一特征信息进行多级时域融合,得到混合时空表征,示例性的,其网络结构如图4所示。
混合运动补偿模块用于对混合时空表征进行混合运动补偿,得到当前图像的目标预测图像,示例性的,混合运动补偿模块可以包括图5所示的第一解码器、和/或图6所示的第二解码器,可选的,若混合运动补偿模块包括第一解码器和第二解码器时,则该混合运动补偿模块还可以包括图7所示的第三解码器。在一些实施例中,该混合运动补偿模块还可以包括如图8所示的第四解码器。
示例性的,本申请实施例以运动补偿模块包括第一解码器、第二解码器、第三解码器和第四解码器为例进行说明。
在上述图9所示的基于神经网络的解码器的基础上,结合图10对本申请实施例一种可能的视频解码方法进行介绍。
图10为本申请一实施例提供的视频解码流程示意图,如图10所示,包括:
S301、解码第二码流,得到量化后的第二特征信息。
上述S301的具体实现过程参照上述S201-A的描述,在此不再赘述。
S302、通过反变换模块对量化后的第二特征信息进行反变换,得到重建特征信息。
示例性的,该反变换模块的具体网络结构如图3所示,包括2个非局部自注意力模块和2个上采样模块。
例如,解码端将量化后的第二特征信息输入反变换模块进行反变换,该反变换模块输出重建特征信息。上述S302的具体实现过程参照上述S201-B1的描述,在此不再赘述。
S303、确定重建特征信息的概率分布。
S304、根据重建特征信息的概率分布,预测得到量化后的第一特征信息的概率分布。
S305、根据量化后的第一特征信息的概率分布,对第一码流进行解码,得到量化后的第一特征信息。
上述S303至S305的具体实现过程,参照上述S201-B2、S201-B3和S201-C的具体描述,在此不再赘述。
S306、通过递归聚合模块,对量化后的第一特征信息进行多级时域融合,得到混合时空表征。
可选的,递归聚合模块由至少一个时空递归网络堆叠而成。
示例性的,递归聚合模块的网络结构如图4所示。
例如,解码端将量化后的第一特征信息输入递归聚合模块,以使递归聚合模块将量化后的第一特征信息与前一时刻递归聚合模块的隐式特征信息进行融合,进而输出混合时空表征。上述S306的具体实现过程参照上述S202-A的描述,在此不再赘述。
S307、通过第一解码器对混合时空表征进行处理,得到第一预测图像。
根据上述S306得到混合时空表征后,将该混合时空表征和前一重建图像输入混合运动补偿模块进行运动混合补偿,得到当前图像的目标预测图像。
具体是,通过第一解码器对混合时空表征进行处理,确定光流运动信息,并根据光流运动信息对前一重建图像进行运动补偿,得到第一预测图像。
可选的,第一解码器的网络结构如图5所示。
上述S307的具体实现过程,参照上述S203-A1和S203-A2的具体描述,在此不再赘述。
S308、通过第二解码器对混合时空表征进行处理,得到第二预测图像。
具体是,通过SFE对前一重建图像进行空间特征提取,得到参考特征信息;将参考特征信和混合时空表征输入第二解码器,以使偏移量对参考特征信息进行运动补偿,得到第二预测图像。
可选的,第二解码器的网络结构如图6所示。
上述S308的具体实现过程,参照上述S203-B1至S203-B3的具体描述,在此不再赘述。
S309、通过第三解码器对混合时空表征进行处理,得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。
具体是,将混合时空表征输入第三解码器进行自适应掩膜,得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。
可选的,第三解码器的网络结构如图7所示。
上述S309的具体实现过程,参照上述S204-A11中方式二的具体描述,在此不再赘述。
S310、根据第一权重和第二权重,对第一预测图像和第二预测图像进行加权,得到加权图像。
例如,将第一权重与第一预测图像的乘积,与第二权重与第二预测图像的乘积相加,得到加权图像。
S311、通过第四解码器对混合时空表征进行处理,得到当前图像的残差图像。
具体是,将混合时空表征输入第四解码器进行处理,得到当前图像的残差图像。
可选的,第四解码器的网络结构如图8所示。
上述S311的具体实现过程,参照上述S204-A12的具体描述,在此不再赘述。
S312、根据加权图像和残差图像,确定目标预测图像。
例如,将加权图像和残差图像相加,确定为目标预测图像。
S313、对残差码流进行解码,得到当前图像的残差值。
S314、根据目标预测图像和残差值,得到重建图像。
上述S313和S314的具体实现过程,参照上述S204-B1和S204-B2的具体描述,在此不再赘述。
本申请实施例,通过图9所示的基于神经网络的解码器进行解码时,对量化后的第一特征信息进行多级时域融合,即将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成多个解码信息,例如该多个解码信息包括第一预测图像、第二预测图像、第一预测图像和第二预测图像分别对应的权重、以及残差图像,这样基于这多个解码信息确定当前图像的目标预测图像时,可以有效提高目标预测图像的准确性,进而基于该准确的预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
上文对本申请实施例涉及的视频解码方法进行了描述,在此基础上,下面针对编码端,对本申请涉及的视频编码方法进行描述。
图11为本申请实施例提供的视频编码方法的一种流程示意图。本申请实施例的执行主体可以为上述图1所示的编码器。
如图11所示,本申请实施例的方法包括:
S401、对当前图像以及当前图像的前一重建图像进行特征融合,得到第一特征信息。
本申请实施例提出一种基于神经网络的编码器,该基于神经网络的编码器与基于神经网络的解码器进行端到端训练得到。
本申请实施例中,当前图像的前一重建图像可以理解为视频序列中,位于当前图像之前的前一帧图像,该前一帧图像已解码重建。
由于当前图像X t和当前图像的前一重建图像
Figure PCTCN2022090468-appb-000014
这两个相邻帧之间存在着很强的相似性,因此,编码端在编码时,将当前图像X t和当前图像的前一重建图像
Figure PCTCN2022090468-appb-000015
进行特征融合,得到第一特征信息。例如,编码端将当前图像X t和当前图像的前一重建图像
Figure PCTCN2022090468-appb-000016
进行通道间的级联通过
Figure PCTCN2022090468-appb-000017
得到级联的输入数据X cat
Figure PCTCN2022090468-appb-000018
和X t为SRGB域的3通道视频帧输入,X cat采用逐个通道堆叠的方式将两帧视频合成得到通道数为6的输入信号。接着,对级联后的图像X cat进行特征提取,得到第一特征信息。
本申请实施例对编码端对X cat进行特征提取的具体方式不做限制。例如包括多层卷积、残差连接、密集连接等特征提取方式中的至少一种。
在一些实施例中,编码端对级联后的图像进行Q次非局部注意力变换和Q次下采样,得到第一特征信息,Q为正整数。
例如,编码端将级联后的6通道高维输入信号X cat,输入时空特征提取模块(Spatiotemporal Feature Extraction,STFE)进行多层的特征变换和提取。
可选的,时空特征提取模块包括Q个非局部注意力模块和Q个下采样模块。其中,非局部注意力模块用于实现非局部注意力变换,下采样模块用于实现下采样。示例性的,如图12所示,一个非局部注意力模块后,连接一个下采样模块。在实际应用时,编码端将级联后的6通道高维输入信号X cat输入STFE中,STFE中的第一个非局部注意力模块对X cat进行非局部注意力特征变换提取,得到特征信息11,再将特征信息11输入第一个下采样模块进行下采样,得到特征信息12。接着,将特征信息12输入第二个非局部注意力模块进行非局部注意力特征变换提取,得到特征信息13,再将特征信息13输入第二个下采样模块进行下采样,得到特征信息14。依次类推,得到第Q个下采样模块输出的特征信息,并将该特征信息确定为第一特征信息X F
本申请实施例对Q的具体取值不做限制。
可选的,Q=4。
S402、对第一特征信息进行量化,得到量化后的第一特征信息。
上述得到的第一特征信息为浮点型,例如为32位浮点数表示,进一步的,为了降低编码代价,则编码端对上述得到的第一特征信息进行量化,得到量化后的第一特征信息。
示例性的,编码端采用四舍五入函数Round(.)对第一特征信息量化。
在一些实施例中,在模型训练过程中,对正向传播时,使用如下公式(7)所示的方法对第一特征信息进行量化:
Figure PCTCN2022090468-appb-000019
其中,U(-0.5,0.5)为正负0.5的均匀噪声分布用于近似实际的四舍五入量化函数Round(.)。
在训练过程对公式(7)进行求导得到对应的反向传播梯度为1,并将其作为反向传播的梯度对模型进行更新。
S403、对量化后的第一特征信息进行编码,得到第一码流。
方式一,编码端直接使用量化后的第一特征信息的概率分布,对量化后的第一特征信息进行编码,得到第一码流
上述量化后的第一特征信息所包括冗余信息量较多,直接对量化后的第一特征信息进行编码时,编码所需的码字多,编码代价大。为了降低编码代价,在一些实施例中,编码端根据第一特征信息进行特征变换,得到第二特征信息,并对第二特征信息进行量化后再编码,得到第二码流;对该第二码流进行解码,得到量化后的第二特征信息,并根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布;进而根据量化后的第一特征信息的概率分布,对量化后的第一特征信息进行编码,得到第一码流。也就是说,为了降低编码代价,则编码端确定第一特征信息对应的超先验特征信息,即第二特征信息,并基于该第二特征信息确定量化后的第一特征信息的概率分布,由于第二特征信息为第一特征信息的超先验特征信息,所包括的冗余量较少,这样基于该冗余量较少的第二特征信息确定量化后的第一特征信息的概率分布,并使用该概率分布对第一特征信息进行编码,可以降低第一特征信息的编码代价。
基于上述描述,编码端可以通过如下方式二的步骤,对量化后的第一特征信息进行编码,得到第一码流。
方式二,上述S403包括如下S403-A1至S403-A4的步骤:
S403-A1、根据第一特征信息进行特征变换,得到第二特征信息。
在该方式二中,编码端为了降低编码代价,对第一特征信息进行特征变换,得到该第一特征信息的超先验特征信息,即第二特征信息,使用该第二特征信息确定量化后的第一特征信息的概率分布,并使用该概率分布对量化后的第一特征信息进行编码,得到第一码流。同时,为了使解码端采用与编码相同的概率分布对第一码流进行解码,则对上述第二特征信息进行编码,得到第二码流。也就是说,在该方式二中,编码端生成两个码流,分别为第一码流和第二码流。
本申请实施例中,编码端根据第一特征信息进行特征变换,得到第二特征信息的方式包括但不限于如下几种:
方式1,对第一特征信息进行N次非局部注意力变换和N次下采样,得到第二特征信息。
方式2,对量化后的第一特征信息进行N次非局部注意力变换和N次下采样,得到第二特征信息。
也就是说,编码端可以对第一特征信息或者量化后的第一特征信息进行N次非局部注意力变换和N次下采样,得到第二特征信息。
S403-A2、对第二特征信息进行量化后再编码,得到第二码流。
例如,对第二特征信息进行量化,得到量化后的第二特征信息;确定量化后的第二特征信息的概率分布;根据量化后的第二特征信息的概率分布,对量化后的第二特征信息进行编码,得到第二码流。
本申请中,由于第二特征信息为第一特征信息的超先验特征信息,所包括的冗余信息较少,因此,编码端在编码时,直接使用量化后的第二特征信息的概率分布,对量化后的第二特征信息进行编码,得到第二码流。
S403-A3、对第二码流进行解码,得到量化后的第二特征信息,并根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布。
本申请实施例中,编码端对超先验的第二码流进行算数解码,还原得到量化后的超先验时空特征
Figure PCTCN2022090468-appb-000020
即量化后的第二特征信息,接着,根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布,进而根据量化后的第一特征信息的概率分布对量化后的第一特征信息进行编码,得到第一码流。
下面对上述S403-A3中根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布的过程进行介绍。
在一些实施例中,上述S403-A3中根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布包括如下步骤:
S403-A31、对量化后的第二特征信息进行反变换,得到重建特征信息。
在该实现方式中,编码端对量化后的第二特征信息进行反变换,得到重建特征信息,其中,编码端所采用的反变换方式可以理解为编码端采用的变换方式的逆运算。例如,编码端对第一特征信息进行N次特征提取,得到第二特征信息,对应的,此时编码端对量化后的第二特征信息进行N次反向的特征提取,得到反变换后的特征信息,记为重建特征信息。
本申请实施例对编码端采用反变换方式不做限制。
在一些实施例中,编码端采用的反变换方式包括N次特征提取。也就是说,编码端对得到的量化后的第二特征信息进行N次特征提取,得到重建特征信息。
在一些实施例中,编码端采用的反变换方式包括N次特征提取和N次上采样。也就是说,编码端对得到的量化后的第二特征信息进行N次特征提取和N次上采样,得到重建特征信息。
本申请实施例对上述N次特征提取和N次上采样的具体执行顺序不做限制。
在一种示例中,编码端可以先对量化后的第二特征信息进行N次连续的特征提取后,再进行N次连续的上采样。
在另一种示例中,上述N次特征提取和N次上采样穿插进行,即执行一次特征提取后执行一次上采样。
需要说明的是,本申请实施例对编码端所采用的N次特征提取方式不做限制,例如包括多层卷积、残差连接、密集连接等特征提取方式中的至少一种。
在一些实施例中,编码端对量化后的第二特征信息进行N次非局部注意力变换和N次上采样,得到重建特征信息,N为正整数。
由于非局部注意力方式可以实现更高效的特征提取,能使得提取的特征保留更多的信息,且计算效率高,因此,本申请实施例中,编码端采用非局部注意力的方式对量化后的第二特征信息进行特征提取,以实现对量化后的第二特征信息的快速和准确特征提取。另外,编码端在根据第一特征信息生成第二特征信息时,进行了N次下采样,因此,此时,在反变换时编码端对应的执行N次上采样,以使重建得到的重建特征信息与第一特征信息的大小一致。
在一些实施例中,如图3所示,编码端通过反变换模块得到重建特征信息,该反变换模块包括N个非局部注意力模块和N个上采样模块。
S403-A32、确定重建特征信息的概率分布。
由上述可知,第二量化特征信息是对第一特征信息进行变换得到的,编码端通过上述步骤,对量化后的第二特征信息进行反量化,得到重建特征信息,因此,该重建特征信息可以理解为第一特征信息的重建信息,也就是说,重建特征信息的概率分布与量化后的第一特征信息的概率分布相似或相关,这样,编码端可以先确定出重建特征信息的概率分布,进而根据该重建特征信息的概率分布,预测量化后的所述第一特征信息的概率分布。
在一些实施例中,重建特征信息的概率分布为正态分布或高斯分布,此时,确定重建特征信息的概率分布的过程为,根据重建特征信息中的各特征值,确定该重建特征信息的均值和方差矩阵,根据均值和方差矩阵,生成该重建特征信息的高斯分布。
S403-A33、根据重建特征信息的概率分布,确定量化后的第一特征信息的概率分布。
例如,根据重建特征信息的概率分布,预测量化后的第一特征信息中编码像素的概率;根据量化后的第一特征信息中编码像素的概率,得到量化后的第一特征信息的概率分布。
S403-A4、根据量化后的第一特征信息的概率分布,对量化后的第一特征信息进行编码,得到第一码流。
根据上述步骤,确定出量化后的第一特征信息的概率分布后,使用该概率分布对量化后的第一特征信息进行编码,得到第一码流。
在一些实施例中,本申请实施例还包括确定当前图像的重建图像的步骤,即本申请实施例还包括如下S404:
S404、确定当前图像的重建图像。
在一些实施例中,上述S404包括如下步骤:
S404-A、对量化后的第一特征信息进行多级时域融合,得到混合时空表征。
在一些实施例中,上述量化后的第一特征信息为编码端对第一特征信息进行量化后的特征信息。
在一些实施例中,上述量化后的第一特征信息为编码端重建后的,例如,编码端对第二码流进行解码,得到量化后的第二特征信息,并根据量化后的第二特征信息,确定量化后的第一特征信息的概率分布,示例性的,编码端根据上述S403-A31至S403-A33的方法,得到量化后的第一特征信息的概率分布,进而使用量化后的第一特征信息的概率分布对第一码流进行解码,得到量化后的第一特征信息。
接着,编码端对上述得到的量化后的第一特征信息进行多级时域融合,得到混合时空表征。
本申请实施例中,为了提高重建图像的准确性,对量化后的第一特征信息进行多级的时域融合,即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合,并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,例如将t-1时刻、t-2时刻…、t-k时刻等多个时刻的重建图像与量化后的第一特征信息进行融合。这样可以避免当前图像的前一重建图像中的某信息被遮挡时,被遮挡的信息可以从当前图像之前的几张重建图像中得到,进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成当前图像的P个预测图像时,可以提高生成的预测图像的准确性,进而基于该准确的预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
本申请实施例对编码端对量化后的第一特征信息进行多级时域融合,得到混合时空表征的具体方式不做限制。
在一些实施例中,编码端通过递归聚合模块混合时空表征,即上述S404-A包括如下S404-A1的步骤:
S404-A1、编码端通过递归聚合模块将量化后的第一特征信息,与前一时刻递归聚合模块的隐式特征信息进行融合,得到混合时空表征。
本申请实施例的递归聚合模块在每次生成混合时空表示时,会学习且保留从本次特征信息中所学习到的深层次特征信息,且将学习到的深层次特征作为隐式特征信息作用于下一次的混合时空表征生成,进而提高生成的混合时空表征的准确性。也就是说,本申请实施例中,前一时刻递归聚合模块的隐式特征信息包括了递归聚合模块所学习到的当前图像之前的多张重建图像的特征信息,这样,编码端通过递归聚合模块将量化后的第一特征信息,与前一时刻递归聚合模块的隐式特征信息进行融合,可以生成更加准确、丰富和详细的混合时空表征。
本申请实施例对递归聚合模块的具体网络结构不做限制,例如为可以实现上述功能的任意网络结构。
在一些实施例中,递归聚合模块由至少一个时空递归网络ST-LSTM堆叠而成,此时,上述混合时空表征Gt的表达公式如上述公式(1)所示。
S404-B、根据混合时空表征对前一重建图像进行运动补偿,得到当前图像的P个预测图像,P为正整数。
由上述可知,本申请实施例的混合时空表征融合的当前图像以及当前图像之前的多个重建图像的特征信息,这样根据该混合时空表征对前一重建图像进行运动补偿,可以得到精确的当前图像的P个预测图像。
本申请实施例对生成的P个预测图像的具体数量不做限制。即本申请实施例中,编码端可以采用不同的方式,根据混合时空表征对前一重建图像进行运动补偿,得到当前图像的P个预测图像。
本申请实施例对上述编码端根据混合时空表征对前一重建图像进行运动补偿的具体的方式不做限制。
在一些实施例中,上述P个预测图像中包括第一预测图像,该第一预测图像是编码端采用光流运动补偿方式得到 的,此时,上述S404-B包括如下S404-B1和S404-B2的步骤:
S404-B1、根据混合时空表征,确定光流运动信息;
S404-B2、根据光流运动信息对前一重建图像进行运动补偿,得到第一预测图像。
本申请实施例对编码端根据混合时空表征,确定光流运动信息的具体方式不做限制。
在一些实施例中,编码端通过预先训练好的神经网络模型得到光流运动信息,即该神经网络模型可以基于混合时空表征,预测出光流运动信息。在一些实施例中,该神经网络模型可以称为第一解码器,或光流信号解码器Df。编码端将混合时空表征Gt输入该光流信号解码器Df中进行光流运动信息的预测,得到该光流信号解码器Df输出的光流运动信息f x,y。可选的,该f x,y为通道为2的光流运动信息。
示例性的,f x,y的生成公式如上述公式(2)所示。
本申请实施例对上述光流信号解码器Df的具体网络结构不做限制。
在一些实施例中,光流信号解码器Df由多个NLAM和多个上采样模块组成,示例性的,如图5所示,光流信号解码器Df包括1个NLAM、3个LAM和4个下采样模块,其中一个NLAM之后连接一个下采样模块,且一个LAM之后连接一个下采样模块。
需要说明的是,上述图5只是一种示例中,且图5中各参数的设定也仅为示例,本申请实施例的光流信号解码器Df的网络结构包括但不限于图5所示。
编码端生成光流运动信息f x,y后,使用光流运动信息f x,y对前一重建图像
Figure PCTCN2022090468-appb-000021
进行运动补偿,得到第一预测图像X 1
本申请实施例对编码端根据光流运动信息对前一重建图像进行运动补偿,得到第一预测图像的具体方式不做限制,例如,编码端使用光流运动信息f x,y对前一重建图像
Figure PCTCN2022090468-appb-000022
进行线性插值,将插值生成的图像记为第一预测图像X 1
在一种可能的实现方式中,编码端通过如下公式(3),得到第一预测图像X 1
在该实现方式中,如图5所示,编码端通过Warping(扭曲)运算,使用光流运动信息f x,y对前一重建图像
Figure PCTCN2022090468-appb-000023
进行运动补偿,得到第一预测图像X 1
在一些实施例中,上述P个预测图像中包括第二预测图像,该第二预测图像是解码端采用偏移运动补偿方式得到的,此时,上述S404-B包括如下S404-B-1至S404-B-3的步骤:
S404-B-1、根据混合时空表征,得到当前图像对应的偏移量;
S404-B-2、对前一重建图像进行空间特征提取,得到参考特征信息;
S404-B-3、使用偏移量对参考特征信息进行运动补偿,得到第二预测图像。
本申请实施例对编码端根据混合时空表征,得到当前图像对应的偏移量的具体方式不做限制。
在一些实施例中,编码端通过预先训练好的神经网络模型得到当前图像对应的偏移量,即该神经网络模型可以基于混合时空表征,预测出偏移量,该偏移量为有损的偏移量信息。在一些实施例中,该神经网络模型可以称为第二解码器,或可变卷积解码器Dm。编码端将混合时空表征Gt输入该可变卷积解码器Dm中进行偏移量信息的预测。
同时,编码端对前一重建图像进行空间特征提取,得到参考特征信息。例如,编码端通过空间特征提取模块SFE对前一重建图像进行空间特征提取,得到参考特征信息。
接着,编码端使用偏移量对提取得到的参考特征信息进行运动补偿,得到当前图像的第二预测图像。
本申请实施例对编码端使用偏移量对提取得到的参考特征信息进行运动补偿,得到当前图像的第二预测图像的具体方式不做限制。
在一种可能的实现方式中,编码端使用偏移量,对参考特征信息进行基于可变形卷积的运动补偿,得到第二预测图像。
在一些实施例中,由于可变换卷积可以基于混合时空表征,生成当前图像对应的偏移量,因此,本申请实施例中,编码端将混合时空表征Gt,以及参考特征信息输入该可变换卷积中,该可变换卷积基于混合时空表征Gt生成当前图像对应的偏移量,且将该偏移量作用在参考特征信息上进行运动补偿,进而得到第二预测图像。
基于此,示例性的,如图6所示,本申请实施例的可变卷积解码器Dm包括可变换卷积DCN,编码端将前一重建图像
Figure PCTCN2022090468-appb-000024
输入反变换模块SFE中进行时空特征提取,得到参考特征信息。接着,将混合时空表征Gt,以及参考特征信息输入可变换卷积DCN中进行偏移量的提取以及运动补偿,得到第二预测图像X 2
示例性的,编码端通过上述公式(4)生成第二预测图像X 2
本申请实施例对上述光流信号解码器Df的具体网络结构不做限制。
在一些实施例中,如图6所示,为了进一步提高第二预测图像的准确性,则可变卷积解码器Dm除了包括可变换卷积DCN外,还包括1个NLAM、3个LAM和4个下采样模块,其中一个NLAM之后连接一个下采样模块,且一个LAM之后连接一个下采样模块。
需要说明的是,上述图6只是一种示例中,且图6中各参数的设定也仅为示例,本申请实施例的可变卷积解码器Dm的网络结构包括但不限于图6所示。
本申请实施例中,如图6所示,编码端首先将前一重建图像
Figure PCTCN2022090468-appb-000025
输入反变换模块SFE中进行时空特征提取,得到参考特征信息。接着,将混合时空表征Gt,以及参考特征信息输入可变卷积解码器Dm中的可变换卷积DCN中进行偏移量的提取以及运动补偿,得到一个特征信息,将该特征信息输入NLAM中,经过NLAM、3个LAM以及4个下采样模块的特征提取,最终还原为第二预测图像X 2
根据上述方法,编码端可以确定出P个预测图像,例如确定出第一预测图像和第二预测图像,接着,执行如下S204 的步骤。
S404-C、根据P个预测图像,确定所述当前图像的重建图像。
在一些实施例中,若上述P个预测图像包括一个预测图像时,则根据该预测图像,确定当前图像的重建图像。
例如,将该预测图像与当前图像的前一个或几个重建图像进行比较,计算损失,若该损失小,则说明该预测图像的预测精度较高,可以将该预测图像确定为当前图像的重建图像。
再例如,若上述损失大,则说明该预测图像的预测精度较低,此时,可以根据当前图像的前一个或几个重建图像和该预测图像,确定当前图像的重建图像,例如,将该预测图像和当前图像的前一个或几个重建图像输入一神经网络中,得到当前图像的重建图像。
在一些实施例中,上述S404-C包括如下S404-C-A和S404-C-B的步骤:
S404-C-A、根据P个预测图像,确定当前图像的目标预测图像。
在该实现方式中,编码端首先根据P个预测图像,确定当前图像的目标预测图像,接着,根据该当前图像的目标预测图像实现当前图像的重建图像,进而提高重建图像的确定准确性。
本申请实施例对根据P个预测图像,确定当前图像的目标预测图像的具体方式不做限制。
在一些实施例中,若P=1,则将该一个预测图像确定为当前图像的目标预测图像。
在一些实施例中,若P大于1,则S404-C-A包括S404-C-A11和S404-C-A12:
S404-C-A11、根据P个预测图像,确定加权图像;
在该实现方式中,若根据上述方法,生成当前图像的多个预测图像,例如生成第一预测图像和第二预测图像时,则对这P个预测图像进行加权,生成加权图像,则根据该加权图像,得到目标预测图像。
本申请实施例对根据P个预测图像,确定加权图像的具体方式不做限制。
例如,确定P个预测图像对应的权重;并根据P个预测图像对应的权重,对P个预测图像进行加权,得到加权图像。
示例性的,若P个预测图像包括第一预测图像和第二预测图像,则编码端确定第一预测图像对应的第一权重和第二预测图像对应的第二权重,根据第一权重和所述第二权重,对第一预测图像和第二预测图像进行加权,得到加权图像。
其中,确定P个预测图像对应的权重的方式包括但不限于如下几种:
方式一,上述P个预测图像对应的权重为预设权重。假设P=2,即第一预测图像对应的第一权重和第二预测图像对应的第二权重可以是,第一权重等于第二权重,或者第一权重与第二权重的比值为1/2、1/4、1/2、1/3、2/1、3/1、4/1等等。
方式二,编码端根据混合时空表征进行自适应掩膜,得到P个预测图像对应的权重。
示例性的,编码端通过神经网络模型,生成P个预测图像对应的权重,该神经网络模型为预先训练好的,可以用于生成P个预测图像对应的权重。在一些实施例中,该神经网络模型也称为第三解码器或自适应掩膜补偿解码器D w。具体的,编码端将混合时空表征输入该自适应掩膜补偿解码器D w中进行自适应掩膜,得到P个预测图像对应的权重。例如,编码端将混合时空表征Gt输入该自适应掩膜补偿解码器D w中进行自适应掩膜,自适应掩膜补偿解码器D w输出第一预测图像的第一权重w1和第二预测图像的第二权重w2,进行根据第一权重w1和第二权重w2对上述得到第一预测图像X 1和第二预测图像X 2,能自适应地选择相应代表预测帧中不同区域地信息,进而生成加权图像。
示例性的,根据上述公式(5)生成加权图像X 3
在一些实施例中,上述P个预测图像对应的权重为一个矩阵,包括了预测图像中每个像素点对应的权重,这样在生成加权图像时,针对当前图像中的每个像素点,将P个预测图像中该像素点分别对应的预测值及其权重进行加权运算,得到该像素点的加权预测值,这样当前图像中每个像素点对应的加权预测值组成当前图像的加权图像。
本申请实施例对上述自适应掩膜补偿解码器D w的具体网络结构不做限制。
在一些实施例中,如图7所示,自适应掩膜补偿解码器D w包括1个NLAM、3个LAM、4个下采样模块和一个sigmoid函数,其中一个NLAM之后连接一个下采样模块,一个LAM之后连接一个下采样模块。
需要说明的是,上述图7只是一种示例中,且图7中各参数的设定也仅为示例,本申请实施例的自适应掩膜补偿解码器D w的网络结构包括但不限于图7所示。
在该实现方式中,编码端根据上述方法,对P个预测图像进行加权,得到加权图像后,执行如下S404-C-A12。
S404-C-A12、根据加权图像,得到目标预测图像。
例如,将该加权图像,确定为目标预测图像。
在一些实施例中,编码端还可以根据混合时空表征,得到当前图像的残差图像。
示例性的,编码端通过神经网络模型,得到当前图像的残差图像,该神经网络模型为预先训练好的,可以用于生成当前图像的残差图像。在一些实施例中,该神经网络模型也称为第四解码器或空间纹理增强解码器Dt。具体的,编码端将混合时空表征输入该空间纹理增强解码器Dt中进行空间纹理增强,得到当前图像的残差图像X r=D_t(G t),该残差图像X r可以对预测图像进行纹理增强。
本申请实施例中,对上述空间纹理增强解码器Dt的具体网络结构不做限制。
在一些实施例中,如图8所示,空间纹理增强解码器Dt包括1个NLAM、3个LAM、4个下采样模块,其中一个NLAM之后连接一个下采样模块,一个LAM之后连接一个下采样模块。
需要说明的是,上述图8只是一种示例中,且图8中各参数的设定也仅为示例,本申请实施例的空间纹理增强解码器Dt的网络结构包括但不限于图8所示。
由于上述残差图像X r可以对预测图像进行纹理增强。基于此,在一些实施例中,上述S404-C-A中根据P个预测 图像,确定当前图像的目标预测图像包括如下S404-C-A21的步骤:
S404-C-A21、根据P个预测图像和残差图像,得到目标预测图像。
例如,若P=1,则根据该预测图像和残差图像,得到目标预测图像,例如,将该预测图像与残差图像进行相加,生成目标预测图像。
再例如,若P大于1时,则首先根据P个预测图像,确定加权图像;再根据加权图像和残差图像,确定目标预测图像。
其中,编码端根据P个预测图像,确定加权图像的具体过程可以参照上述S204-A11的具体描述,在此不再赘述。
举例说明,以P=2为例,根据上述方法,确定出第一预测图像对应的第一权重w1和第二预测图像对应的第二权重w2,可选的,根据上述公式(5)对第一预测图像和第二预测图像进行加权,得到加权图像X 3,接着,使用残差图像X r对加权图像X 3进行增强,得到目标预测图像。
示例性的,根据上述公式(6)生成目标预测图像X 4
根据上述方法,编码端确定出当前图像的目标预测图像后,执行如下S404-C-B的步骤。
S404-C-B、根据目标预测图像,确定当前图像的重建图像。
在一些实施例中,将该目标预测图像与当前图像的前一个或几个重建图像进行比较,计算损失,若该损失小,则说明该目标预测图像的预测精度较高,可以将该目标预测图像确定为当前图像的重建图像。若上述损失大,则说明该目标预测图像的预测精度较低,此时,可以根据当前图像的前一个或几个重建图像和该目标预测图像,确定当前图像的重建图像,例如,将该目标预测图像和当前图像的前一个或几个重建图像输入一神经网络中,得到当前图像的重建图像。
在一些实施例中,为了进一步提高重建图像的确定准确性,则编码端根据当前图像和目标预测图像,确定当前图像的残差值;对残差值进行编码,得到残差码流。此时,则本申请实施例还包括残差解码,上述S404-C-B包括如下S404-C-B1和S404-C-B2的步骤:
S404-C-B1、对残差码流进行解码,得到当前图像的残差值;
S404-C-B2、根据目标预测图像和残差值,得到重建图像。
本申请实施例中,为了提高重建图像的效果,则编码端还通过残差编码的方式,生成残差码流,具体是,编码端确定当前图像的残差值,对该残差值进行编码生成残差码流。对应的,编码端对残差码流进行解码,得到当前图像的残差值,并根据目标预测图像和残差值,得到重建图像。
本申请实施例对上述当前图像的残差值的具体表示形式不做限制。
在一种可能的实现方式中,当前图像的残差值为一个矩阵,该矩阵中的每个元素为当前图像中每个像素点对应的残差值。这样,编码端可以逐像素的,将目标预测图像中每个像素点对应的残差值和预测值进行相加,得到每个像素点的重建值,进而得到当前图像的重建图像。以当前图像中的第i个像素点为例,在目标预测图像中,得到该第i个像素点对应的预测值,以及从当前图像的残差值中得到该第i个像素点对应的残差值,接着,将该第i个像素点对应的预测值和残差值进行相加,得到该第i个像素点对应的重建值。针对当前图像中的每个像素点,参照上述第i个像素点,可以得到当前图像中每个像素点对应的重建值,当前图像中每个像素点对应的重建值,组成当前图像的重建图像。
本申请实施例对编码端得到当前图像的残差值的具体方式不做限制,也就是说,本申请实施例对编解码两端所采用的残差编解码的方式不做限制。
在一种示例中,编码端确定出当前图像的目标预测图像,接着,根据当前图像和目标预测图像,得到当前图像的残差值,例如,将当前图像和目标预测图像的差值确定为当前图像的残差值。接着,对当前图像的残差值进行编码,生成残差编码。可选的,可以对当前图像的残差值进行变换,得到变换系数,对变换系数进行量化得到量化系数,对量化系数进行编码,得到残差码流。对应的,编码端解码残差码流,得到当前图像的残差值,例如解码残差码流,得到量化系数,对量化系数进行反量化和反变换,得到当前图像的残差值。接着,再根据上述方法,将目标预测图像和当前图像对应的残差值进行相加,得到当前图像的重建图像。
在一些实施例中,编码端可以采用神经网络的方法,对当前图像和当前图像的目标预测图像进行处理,生成当前图像的残差值,并对当前图像的残差值进行编码,生成残差码流。
本申请实施例中,编码端根据上述方法,可以得到当前图像的重建图像。
可选的,可以将该重建图像进行直接显示。
可选的,还可以将该重建图像存入缓存中,用于后续图像的编码。
本申请实施例提供的视频编码方法,编码端通过对当前图像以及当前图像的前一重建图像进行特征融合,得到第一特征信息;对第一特征信息进行量化,得到量化后第一特征信息;对量化后的第一特征信息进行编码,得到第一码流,以使解码端解码第一码流,确定量化后的第一特征信息,对量化后的第一特征信息进行多级时域融合,得到混合时空表征;根据混合时空表征对所述前一重建图像进行运动补偿,得到当前图像的P个预测图像;进而根据P个预测图像,确定当前图像的重建图像。即本申请,为了提高重建图像的准确性,对量化后的第一特征信息进行多级时域融合,例如将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,这样可以避免当前图像的前一重建图像中的某信息被遮挡时,被遮挡的信息可以从当前图像之前的几张重建图像中得到,进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时,可以生成高精度的P个预测图像时,基于该高精度的P个预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
本申请实施例中,提出一种端到端的基于神经网络的编解码框架,该基于神经网络的编解码框架包括基于神经网络的编码器和基于神经网络的解码器。下面结合的本申请一种可能的基于神经网络的编码器,对本申请实施例的编码 过程进行介绍。
图12为本申请一实施例涉及的一种基于神经网络的编码器的网络结构示意图,包括:时空特征提取模块、反变换模块、递归聚合模块和混合运动补偿模块。
其中,时空特征提取模块用于对级联后的当前图像和前一重建图像进行特征提取和下采样,得到第一特征信息。
反变换模块用于对量化后的第二特征信息进行反变换,得到第一特征信息的重建特征信息,示例性的,其网络结构如图3所示。
递归聚合模块用于对量化后的第一特征信息进行多级时域融合,得到混合时空表征,示例性的,其网络结构如图4所示。
混合运动补偿模块用于对混合时空表征进行混合运动补偿,得到当前图像的目标预测图像,示例性的,混合运动补偿模块可以包括图5所示的第一解码器、和/或图6所示的第二解码器,可选的,若混合运动补偿模块包括第一解码器和第二解码器时,则该混合运动补偿模块还可以包括图7所示的第三解码器。在一些实施例中,该混合运动补偿模块还可以包括如图8所示的第四解码器。
示例性的,本申请实施例以运动补偿模块包括第一解码器、第二解码器、第三解码器和第四解码器为例进行说明。
在上述图12所示的基于神经网络的编码器的基础上,结合图13对本申请实施例一种可能的视频编码方法进行介绍。
图13为本申请一实施例提供的视频编码流程示意图,如图13所示,包括:
S501、对当前图像以及当前图像的前一重建图像进行特征融合,得到第一特征信息。
例如,编码端将当前图像X t和当前图像的前一重建图像
Figure PCTCN2022090468-appb-000026
进行通道间的级联得到X cat,接着,对级联后的图像X cat进行特征提取,得到第一特征信息。
上述S501的具体实现过程参照上述S401的描述,在此不再赘述。
S502、对第一特征信息进行量化,得到量化后的第一特征信息。
上述S502的具体实现过程参照上述S402的描述,在此不再赘述。
S503、根据第一特征信息进行特征变换,得到第二特征信息。
上述S503的具体实现过程参照上述S403-A1的描述,在此不再赘述。
S504、对第二特征信息进行量化后再编码,得到第二码流。
上述S504的具体实现过程参照上述S403-A2的描述,在此不再赘述。
S505、对第二码流进行解码,得到量化后的第二特征信息。
上述S505的具体实现过程参照上述S403-A3的描述,在此不再赘述。
S506、通过反变换模块对量化后的第二特征信息进行反变换,得到重建特征信息。
示例性的,该反变换模块的具体网络结构如图3所示,包括2个非局部自注意力模块和2个上采样模块。
例如,解码端将量化后的第二特征信息输入反变换模块进行反变换,该反变换模块输出重建特征信息。
上述S506的具体实现过程参照上述S403-A31的描述,在此不再赘述。
S507、确定重建特征信息的概率分布。
S508、根据重建特征信息的概率分布,预测得到量化后的第一特征信息的概率分布。
S509、根据量化后的第一特征信息的概率分布,对量化后的第一特征信息进行编码,得到第一码流。
上述S507至的S509具体实现过程参照上述S403-A32、S403-A33和S403-A4的描述,在此不再赘述。
本申请实施例还包括确定重建图像的过程。
S510、根据量化后的第一特征信息的概率分布,对第一码流进行解码,得到量化后的第一特征信息。
S511、通过递归聚合模块,对量化后的第一特征信息进行多级时域融合,得到混合时空表征。
可选的,递归聚合模块由至少一个时空递归网络堆叠而成。
示例性的,递归聚合模块的网络结构如图4所示。
例如,解码端将量化后的第一特征信息输入递归聚合模块,以使递归聚合模块将量化后的第一特征信息与前一时刻递归聚合模块的隐式特征信息进行融合,进而输出混合时空表征。上述S511的具体实现过程参照上述S404-A的描述,在此不再赘述。
S512、通过第一解码器对混合时空表征进行处理,得到第一预测图像。
根据上述S512得到混合时空表征后,将该混合时空表征和前一重建图像输入混合运动补偿模块进行运动混合补偿,得到当前图像的目标预测图像。
具体是,通过第一解码器对混合时空表征进行处理,确定光流运动信息,并根据光流运动信息对前一重建图像进行运动补偿,得到第一预测图像。
可选的,第一解码器的网络结构如图5所示。
上述S512的具体实现过程,参照上述S404-B1和S404-B2的具体描述,在此不再赘述。
S513、通过第二解码器对混合时空表征进行处理,得到第二预测图像。
具体是,通过SFE对前一重建图像进行空间特征提取,得到参考特征信息;将参考特征信和混合时空表征输入第二解码器,以使偏移量对参考特征信息进行运动补偿,得到第二预测图像。
可选的,第二解码器的网络结构如图6所示。
上述S513的具体实现过程,参照上述S404-B-1至S404-B-3的具体描述,在此不再赘述。
S514、通过第三解码器对混合时空表征进行处理,得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。
具体是,将混合时空表征输入第三解码器进行自适应掩膜,得到第一预测图像对应的第一权重和第二预测图像对 应的第二权重。
可选的,第三解码器的网络结构如图7所示。
上述S514的具体实现过程,参照上述S404-C-A11中方式二的具体描述,在此不再赘述。
S515、根据第一权重和第二权重,对第一预测图像和第二预测图像进行加权,得到加权图像。
例如,将第一权重与第一预测图像的乘积,与第二权重与第二预测图像的乘积相加,得到加权图像。
S516、通过第四解码器对混合时空表征进行处理,得到当前图像的残差图像。
具体是,将混合时空表征输入第四解码器进行处理,得到当前图像的残差图像。
可选的,第四解码器的网络结构如图8所示。
上述S516的具体实现过程,参照上述S404-C-A12的具体描述,在此不再赘述。
S517、根据加权图像和残差图像,确定目标预测图像。
例如,将加权图像和残差图像相加,确定为目标预测图像。
S518、对残差码流进行解码,得到当前图像的残差值。
S519、根据目标预测图像和残差值,得到重建图像。
上述S518和S519的具体实现过程,参照上述S404-C-B1和S404-C-B2的具体描述,在此不再赘述。
本申请实施例,通过图12所示的基于神经网络的编码器进行编码时,对量化后的第一特征信息进行多级时域融合,即将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合,使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成多个解码信息,例如该多个解码信息包括第一预测图像、第二预测图像、第一预测图像和第二预测图像分别对应的权重、以及残差图像,这样基于这多个解码信息确定当前图像的目标预测图像时,可以有效提高目标预测图像的准确性,进而基于该准确的预测图像可以准确得到当前图像的重建图像,进而提高视频压缩效果。
应理解,图2至图13仅为本申请的示例,不应理解为对本申请的限制。
以上结合附图详细描述了本申请的优选实施方式,但是,本申请并不限于上述实施方式中的具体细节,在本申请的技术构思范围内,可以对本申请的技术方案进行多种简单变型,这些简单变型均属于本申请的保护范围。例如,在上述具体实施方式中所描述的各个具体技术特征,在不矛盾的情况下,可以通过任何合适的方式进行组合,为了避免不必要的重复,本申请对各种可能的组合方式不再另行说明。又例如,本申请的各种不同的实施方式之间也可以进行任意组合,只要其不违背本申请的思想,其同样应当视为本申请所公开的内容。
还应理解,在本申请的各种方法实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。另外,本申请实施例中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系。具体地,A和/或B可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
上文结合图2至图13,详细描述了本申请的方法实施例,下文结合图14至图17,详细描述本申请的装置实施例。
图14是本申请实施例提供的视频解码装置的示意性框图。
如图14所示,视频解码装置10包括:
解码单元11,用于解码第一码流,确定量化后的第一特征信息,所述第一特征信息是对当前图像和所述当前图像的前一重建图像进行特征融合得到的;
融合单元12,用于对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征;
补偿单元13,用于根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,所述P为正整数;
重建单元14,用于根据所述P个预测图像,确定所述当前图像的重建图像。
在一些实施例中,融合单元12,具体用于通过递归聚合模块将量化后的所述第一特征信息,与前一时刻所述递归聚合模块的隐式特征信息进行融合,得到所述混合时空表征。
可选的,所述递归聚合模块由至少一个时空递归网络堆叠而成。
在一些实施例中,所述P个预测图像包括第一预测图像,补偿单元13,具体用于根据所述混合时空表征,确定光流运动信息;根据所述光流运动信息对所述前一重建图像进行运动补偿,得到所述第一预测图像。
在一些实施例中,所述P个预测图像包括第二预测图像,补偿单元13,具体用于根据所述混合时空表征,得到所述当前图像对应的偏移量;对所述前一重建图像进行空间特征提取,得到参考特征信息;使用所述偏移量对所述参考特征信息进行运动补偿,得到所述第二预测图像。
在一些实施例中,补偿单元13,具体用于使用所述偏移量,对所述参考特征信息进行基于可变形卷积的运动补偿,得到所述第二预测图像。
在一些实施例中,重建单元14,用于根据所述P个预测图像,确定所述当前图像的目标预测图像;根据所述目标预测图像,确定所述当前图像的重建图像。
在一些实施例中,重建单元14,用于根据所述P个预测图像,确定加权图像;根据所述加权图像,得到所述目标预测图像。
在一些实施例中,重建单元14,还用于根据所述混合时空表征,得到所述当前图像的残差图像;根据所述P个预测图像和所述残差图像,得到所述目标预测图像。
在一些实施例中,重建单元14,具体用于根据所述P个预测图像,确定加权图像;根据所述加权图像和所述残差图像,确定所述目标预测图像。
在一些实施例中,重建单元14,具体用于确定所述P个预测图像对应的权重;根据所述P个预测图像对应的权 重,对所述P个预测图像进行加权,得到所述加权图像。
在一些实施例中,重建单元14,具体用于根据所述混合时空表征进行自适应掩膜,得到所述P个预测图像对应的权重。
在一些实施例中,若所述P个预测图像包括第一预测图像和第二预测图像,重建单元14,具体用于确定所述第一预测图像对应的第一权重和所述第二预测图像对应的第二权重;根据所述第一权重和所述第二权重,对所述第一预测图像和所述第二预测图像进行加权,得到所述加权图像。
在一些实施例中,重建单元14,具体用于对残差码流进行解码,得到所述当前图像的残差值;根据所述目标预测图像和所述残差值,得到所述重建图像。
在一些实施例中,解码单元11,具体用于解码第二码流,得到量化后的第二特征信息,所述第二特征信息是对所述第一特征信息进行特征变换得到的;根据量化后的所述第二特征信息,确定量化后的所述第一特征信息的概率分布根据量化后的所述第一特征信息的概率分布,对所述第一码流进行解码,得到量化后的所述第一特征信息。
在一些实施例中,解码单元11,具体用于对量化后的所述第二特征信息进行反变换,得到重建特征信息;确定所述重建特征信息的概率分布;根据所述重建特征信息的概率分布,预测得到量化后的所述第一特征信息的概率分布。
在一些实施例中,解码单元11,具体用于对量化后的所述第二特征信息进行N次非局部注意力变换和N次上采样,得到所述重建特征信息,所述N为正整数。
在一些实施例中,解码单元11,具体用于根据所述重建特征信息的概率分布,预测量化后的所述第一特征信息中编码像素的概率;根据量化后的所述第一特征信息中编码像素的概率,得到量化后的所述第一特征信息的概率分布。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图14所示的视频解码装置10可以对应于执行本申请实施例的方法中的相应主体,并且视频解码装置10中的各个单元的前述和其它操作和/或功能分别为了实现方法等各个方法中的相应流程,为了简洁,在此不再赘述。
图15是本申请实施例提供的视频编码装置的示意性框图。
如图15所示,视频编码装置20包括:
融合单元21,用于对当前图像以及所述当前图像的前一重建图像进行特征融合,得到第一特征信息;
量化单元22,用于对所述第一特征信息进行量化,得到量化后的所述第一特征信息;
编码单元23,用于对量化后的所述第一特征信息进行编码,得到所述第一码流。
在一些实施例中,融合单元21,具体用于将所述当前图像和所述重建图像进行通道级联,得到级联后的图像;对所述级联后的图像进行特征提取,得到所述第一特征信息。
在一些实施例中,融合单元21,具体用于对所述级联后的图像进行Q次非局部注意力变换和Q次下采样,得到所述第一特征信息,所述Q为正整数。
在一些实施例中,编码单元23,还用于根据所述第一特征信息进行特征变换,得到第二特征信息;对所述第二特征信息进行量化后再编码,得到第二码流;对所述第二码流进行解码,得到量化后的所述第二特征信息,并根据量化后的所述第二特征信息,确定量化后的所述第一特征信息的概率分布;根据量化后的所述第一特征信息的概率分布,对量化后的所述第一特征信息进行编码,得到第一码流。
在一些实施例中,编码单元23,具体用于对所述第一特征信息进行N次非局部注意力变换和N次下采样,得到所述第二特征信息,所述N为正整数。
在一些实施例中,编码单元23,具体用于对量化后的所述第一特征信息进行N次非局部注意力变换和N次下采样,得到所述第二特征信息。
在一些实施例中,编码单元23,还用于对所述第二特征信息进行量化,得到量化后的所述第二特征信息;确定量化后的所述第二特征信息的概率分布;根据量化后的所述第二特征信息的概率分布,对量化后的所述第二特征信息进行编码,得到所述第二码流。
在一些实施例中,编码单元23,具体用于对量化后的所述第二特征信息进行反变换,得到重建特征信息;确定所述重建特征信息的概率分布;根据所述重建特征信息的概率分布,确定量化后的所述第一特征信息的概率分布。
在一些实施例中,编码单元23,具体用于对量化后的所述第二特征信息进行N次非局部注意力变换和N次上采样,得到所述重建特征信息,所述N为正整数。
在一些实施例中,编码单元23,具体用于根据所述重建特征信息的概率分布,确定量化后的所述第一特征信息中编码像素的概率;根据量化后的所述第一特征信息中编码像素的概率,得到量化后的所述第一特征信息的概率分布。
在一些实施例中,编码单元23,还用于确定所述当前图像的重建图像。
在一些实施例中,编码单元23,具体用于对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征;根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,所述P为正整数;根据所述P个预测图像,确定所述当前图像的重建图像。
在一些实施例中,编码单元23,具体用于通过递归聚合模块将量化后的所述第一特征信息,与前一时刻所述递归聚合模块的隐式特征信息进行融合,得到所述混合时空表征。
可选的,所述递归聚合模块由至少一个时空递归网络堆叠而成。
在一些实施例中,所述P个预测图像包括第一预测图像,编码单元23,具体用于根据所述混合时空表征,确定光流运动信息;根据所述光流运动信息对所述前一重建图像进行运动补偿,得到所述第一预测图像。
在一些实施例中,所述P个预测图像包括第二预测图像,编码单元23,具体用于根据所述混合时空表征,得到所述当前图像对应的偏移量;对所述前一重建图像进行空间特征提取,得到参考特征信息;使用所述偏移量对所述参考特征信息进行运动补偿,得到所述第二预测图像。
在一些实施例中,编码单元23,具体用于使用所述偏移量,对所述参考特征信息进行基于可变形卷积的运动补偿,得到所述第二预测图像。
在一些实施例中,编码单元23,具体用于根据所述P个预测图像,确定所述当前图像的目标预测图像;根据所述目标预测图像,确定所述当前图像的重建图像。
在一些实施例中,编码单元23,具体用于根据所述P个预测图像,确定加权图像;根据所述加权图像,得到所述目标预测图像。
在一些实施例中,编码单元23,还用于根据所述混合时空表征,得到所述当前图像的残差图像;根据所述P个预测图像和所述残差图像,得到所述目标预测图像。
在一些实施例中,若所述P大于1,编码单元23,具体用于根据所述P个预测图像,确定加权图像;根据所述加权图像和所述残差图像,确定所述目标预测图像。
在一些实施例中,编码单元23,具体用于确定所述P个预测图像对应的权重;根据所述P个预测图像对应的权重,对所述P个预测图像进行加权,得到所述加权图像。
在一些实施例中,编码单元23,具体用于根据所述混合时空表征进行自适应掩膜,得到所述P个预测图像对应的权重。
在一些实施例中,若所述P个预测图像包括第一预测图像和第二预测图像,编码单元23,具体用于确定所述P个预测图像,确定所述第一预测图像对应的第一权重和所述第二预测图像对应的第二权重;根据所述第一权重和所述第二权重,对所述第一预测图像和所述第二预测图像进行加权,得到所述加权图像。
在一些实施例中,编码单元23,还用于根据所述当前图像和所述目标预测图像,确定所述当前图像的残差值;对所述残差值进行编码,得到残差码流。
在一些实施例中,编码单元23,具体用于对所述残差码流进行解码,得到所述当前图像的残差值;根据所述目标预测图像和所述残差值,得到所述重建图像。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图15所示的视频编码装置20可以对应于执行本申请实施例的方法中的相应主体,并且视频编码装置20中的各个单元的前述和其它操作和/或功能分别为了实现方法等各个方法中的相应流程,为了简洁,在此不再赘述。
上文中结合附图从功能单元的角度描述了本申请实施例的装置和系统。应理解,该功能单元可以通过硬件形式实现,也可以通过软件形式的指令实现,还可以通过硬件和软件单元组合实现。具体地,本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成,结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件单元组合执行完成。可选地,软件单元可以位于随机存储器,闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法实施例中的步骤。
图16是本申请实施例提供的电子设备的示意性框图。
如图16所示,该电子设备30可以为本申请实施例所述的视频编码器,或者视频解码器,该电子设备30可包括:
存储器33和处理器32,该存储器33用于存储计算机程序34,并将该程序代码34传输给该处理器32。换言之,该处理器32可以从存储器33中调用并运行计算机程序34,以实现本申请实施例中的方法。
例如,该处理器32可用于根据该计算机程序34中的指令执行上述方法中的步骤。
在本申请的一些实施例中,该处理器32可以包括但不限于:
通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等等。
在本申请的一些实施例中,该存储器33包括但不限于:
易失性存储器和/或非易失性存储器。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
在本申请的一些实施例中,该计算机程序34可以被分割成一个或多个单元,该一个或者多个单元被存储在该存储器33中,并由该处理器32执行,以完成本申请提供的方法。该一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述该计算机程序34在该电子设备30中的执行过程。
如图16所示,该电子设备30还可包括:
收发器33,该收发器33可连接至该处理器32或存储器33。
其中,处理器32可以控制该收发器33与其他设备进行通信,具体地,可以向其他设备发送信息或数据,或接收其他设备发送的信息或数据。收发器33可以包括发射机和接收机。收发器33还可以进一步包括天线,天线的数量可以为一个或多个。
应当理解,该电子设备30中的各个组件通过总线系统相连,其中,总线系统除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。
图17是本申请实施例提供的视频编解码系统40的示意性框图。
如图17所示,该视频编解码系统40可包括:视频编码器41和视频解码器42,其中视频编码器41用于执行本申请实施例涉及的视频编码方法,视频解码器42用于执行本申请实施例涉及的视频解码方法。
在一些实施例中,本申请还提供一种码流,该码流通过上述编码方法得到。
本申请还提供了一种计算机存储介质,其上存储有计算机程序,该计算机程序被计算机执行时使得该计算机能够执行上述方法实施例的方法。或者说,本申请实施例还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得计算机执行上述方法实施例的方法。
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。例如,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以该权利要求的保护范围为准。

Claims (50)

  1. 一种视频解码方法,其特征在于,包括:
    解码第一码流,确定量化后的第一特征信息,所述第一特征信息是对当前图像和所述当前图像的前一重建图像进行特征融合得到的;
    对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征;
    根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,所述P为正整数;
    根据所述P个预测图像,确定所述当前图像的重建图像。
  2. 根据权利要求1所述的方法,其特征在于,所述对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征,包括:
    通过递归聚合模块将量化后的所述第一特征信息,与前一时刻所述递归聚合模块的隐式特征信息进行融合,得到所述混合时空表征。
  3. 根据权利要求2所述的方法,其特征在于,所述递归聚合模块由至少一个时空递归网络堆叠而成。
  4. 根据权利要求1所述的方法,其特征在于,所述P个预测图像包括第一预测图像,所述根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,包括:
    根据所述混合时空表征,确定光流运动信息;
    根据所述光流运动信息对所述前一重建图像进行运动补偿,得到所述第一预测图像。
  5. 根据权利要求1所述的方法,其特征在于,所述P个预测图像包括第二预测图像,所述根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,包括:
    根据所述混合时空表征,得到所述当前图像对应的偏移量;
    对所述前一重建图像进行空间特征提取,得到参考特征信息;
    使用所述偏移量对所述参考特征信息进行运动补偿,得到所述第二预测图像。
  6. 根据权利要求5所述的方法,其特征在于,所述使用所述偏移量对所述参考特征信息进行运动补偿,得到所述第二预测图像,包括:
    使用所述偏移量,对所述参考特征信息进行基于可变形卷积的运动补偿,得到所述第二预测图像。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述根据所述P个预测图像,确定所述当前图像的重建图像,包括:
    根据所述P个预测图像,确定所述当前图像的目标预测图像;
    根据所述目标预测图像,确定所述当前图像的重建图像。
  8. 根据权利要求7所述的方法,其特征在于,若所述P大于1时,所述根据所述P个预测图像,确定所述当前图像的目标预测图像,包括:
    根据所述P个预测图像,确定加权图像;
    根据所述加权图像,得到所述目标预测图像。
  9. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    根据所述混合时空表征,得到所述当前图像的残差图像;
    所述根据所述P个预测图像,确定所述当前图像的目标预测图像,包括:
    根据所述P个预测图像和所述残差图像,得到所述目标预测图像。
  10. 根据权利要求9所述的方法,其特征在于,若所述P大于1,所述根据所述P个预测图像和所述残差图像,得到所述目标预测图像,包括:
    根据所述P个预测图像,确定加权图像;
    根据所述加权图像和所述残差图像,确定所述目标预测图像。
  11. 根据权利要求8或10所述的方法,其特征在于,所述根据所述P个预测图像,确定加权图像,包括:
    确定所述P个预测图像对应的权重;
    根据所述P个预测图像对应的权重,对所述P个预测图像进行加权,得到所述加权图像。
  12. 根据权利要求11所述的方法,其特征在于,所述确定所述P个预测图像对应的权重,包括:
    根据所述混合时空表征进行自适应掩膜,得到所述P个预测图像对应的权重。
  13. 根据权利要求11所述的方法,其特征在于,若所述P个预测图像包括第一预测图像和第二预测图像,所述确定所述P个预测图像对应的权重,包括:
    确定所述第一预测图像对应的第一权重和所述第二预测图像对应的第二权重;
    所述根据所述P个预测图像对应的权重,对所述P个预测图像进行加权,得到所述加权图像,包括:
    根据所述第一权重和所述第二权重,对所述第一预测图像和所述第二预测图像进行加权,得到所述加权图像。
  14. 根据权利要求7所述的方法,其特征在于,所述根据所述目标预测图像,确定所述当前图像的重建图像,包括:
    对残差码流进行解码,得到所述当前图像的残差值;
    根据所述目标预测图像和所述残差值,得到所述重建图像。
  15. 根据权利要求1-6任一项所述的方法,其特征在于,所述解码第一码流,确定量化后的第一特征信息,包括:
    解码第二码流,得到量化后的第二特征信息,所述第二特征信息是对所述第一特征信息进行特征变换得到的;
    根据量化后的所述第二特征信息,确定量化后的所述第一特征信息的概率分布;
    根据量化后的所述第一特征信息的概率分布,对所述第一码流进行解码,得到量化后的所述第一特征信息。
  16. 根据权利要求15所述的方法,其特征在于,所述根据量化后的所述第二特征信息,确定量化后的第一特征信 息的概率分布信息,包括:
    对量化后的所述第二特征信息进行反变换,得到重建特征信息;
    确定所述重建特征信息的概率分布;
    根据所述重建特征信息的概率分布,预测得到量化后的所述第一特征信息的概率分布。
  17. 根据权利要求16所述的方法,其特征在于,所述对量化后的所述第二特征信息进行反变换,得到重建特征信息,包括:
    对量化后的所述第二特征信息进行N次非局部注意力变换和N次上采样,得到所述重建特征信息,所述N为正整数。
  18. 根据权利要求16所述的方法,其特征在于,所述根据所述重建特征信息的概率分布,预测得到量化后的所述第一特征信息的概率分布,包括:
    根据所述重建特征信息的概率分布,预测量化后的所述第一特征信息中编码像素的概率;
    根据量化后的所述第一特征信息中编码像素的概率,得到量化后的所述第一特征信息的概率分布。
  19. 一种视频编码方法,其特征在于,包括:
    对当前图像以及所述当前图像的前一重建图像进行特征融合,得到第一特征信息;
    对所述第一特征信息进行量化,得到量化后的所述第一特征信息;
    对量化后的所述第一特征信息进行编码,得到第一码流。
  20. 根据权利要求19所述的方法,其特征在于,所述对当前图像以及所述当前图像之前的重建图像进行特征融合,得到第一特征信息,包括:
    将所述当前图像和所述重建图像进行通道级联,得到级联后的图像;
    对所述级联后的图像进行特征提取,得到所述第一特征信息。
  21. 根据权利要求20所述的方法,其特征在于,所述对所述级联后的图像进行特征提取,得到所述第一特征信息,包括:
    对所述级联后的图像进行Q次非局部注意力变换和Q次下采样,得到所述第一特征信息,所述Q为正整数。
  22. 根据权利要求19所述的方法,其特征在于,所述对量化后的所述第一特征信息进行编码,得到所述第一码流,包括:
    根据所述第一特征信息进行特征变换,得到第二特征信息;
    对所述第二特征信息进行量化后再编码,得到第二码流;
    对所述第二码流进行解码,得到量化后的所述第二特征信息,并根据量化后的所述第二特征信息,确定量化后的所述第一特征信息的概率分布;
    根据量化后的所述第一特征信息的概率分布,对量化后的所述第一特征信息进行编码,得到第一码流。
  23. 根据权利要求22所述的方法,其特征在于,所述根据所述第一特征信息进行特征变换,得到第二特征信息,包括:
    对所述第一特征信息进行N次非局部注意力变换和N次下采样,得到所述第二特征信息,所述N为正整数。
  24. 根据权利要求22所述的方法,其特征在于,所述根据所述第一特征信息进行特征变换,得到第二特征信息,包括:
    对量化后的所述第一特征信息进行N次非局部注意力变换和N次下采样,得到所述第二特征信息。
  25. 根据权利要求22所述的方法,其特征在于,所述对所述第二特征信息进行量化后再编码,得到第二码流,包括:
    对所述第二特征信息进行量化,得到量化后的所述第二特征信息;
    确定量化后的所述第二特征信息的概率分布;
    根据量化后的所述第二特征信息的概率分布,对量化后的所述第二特征信息进行编码,得到所述第二码流。
  26. 根据权利要求22所述的方法,其特征在于,所述根据量化后的所述第二特征信息,确定量化后的第一特征信息的概率分布信息,包括:
    对量化后的所述第二特征信息进行反变换,得到重建特征信息;
    确定所述重建特征信息的概率分布;
    根据所述重建特征信息的概率分布,确定量化后的所述第一特征信息的概率分布。
  27. 根据权利要求26所述的方法,其特征在于,所述对量化后的所述第二特征信息进行反变换,得到重建特征信息,包括:
    对量化后的所述第二特征信息进行N次非局部注意力变换和N次上采样,得到所述重建特征信息,所述N为正整数。
  28. 根据权利要求26所述的方法,其特征在于,所述根据所述重建特征信息的概率分布,确定量化后的所述第一特征信息的概率分布,包括:
    根据所述重建特征信息的概率分布,确定量化后的所述第一特征信息中编码像素的概率;
    根据量化后的所述第一特征信息中编码像素的概率,得到量化后的所述第一特征信息的概率分布。
  29. 根据权利要求19-28任一项所述的方法,其特征在于,所述方法还包括:
    确定所述当前图像的重建图像。
  30. 根据权利要求29所述的方法,其特征在于,所述确定所述当前图像的重建图像,包括:
    对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征;
    根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,所述P为正整数;
    根据所述P个预测图像,确定所述当前图像的重建图像。
  31. 根据权利要求30所述的方法,其特征在于,所述对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征,包括:
    通过递归聚合模块将量化后的所述第一特征信息,与前一时刻所述递归聚合模块的隐式特征信息进行融合,得到所述混合时空表征。
  32. 根据权利要求31所述的方法,其特征在于,所述递归聚合模块由至少一个时空递归网络堆叠而成。
  33. 根据权利要求30所述的方法,其特征在于,所述P个预测图像包括第一预测图像,所述根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,包括:
    根据所述混合时空表征,确定光流运动信息;
    根据所述光流运动信息对所述前一重建图像进行运动补偿,得到所述第一预测图像。
  34. 根据权利要求30所述的方法,其特征在于,所述P个预测图像包括第二预测图像,所述根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,包括:
    根据所述混合时空表征,得到所述当前图像对应的偏移量;
    对所述前一重建图像进行空间特征提取,得到参考特征信息;
    使用所述偏移量对所述参考特征信息进行运动补偿,得到所述第二预测图像。
  35. 根据权利要求34所述的方法,其特征在于,所述使用所述偏移量对所述参考特征信息进行运动补偿,得到所述第二预测图像,包括:
    使用所述偏移量,对所述参考特征信息进行基于可变形卷积的运动补偿,得到所述第二预测图像。
  36. 根据权利要求30-35任一项所述的方法,其特征在于,所述根据所述P个预测图像,确定所述当前图像的重建图像,包括:
    根据所述P个预测图像,确定所述当前图像的目标预测图像;
    根据所述目标预测图像,确定所述当前图像的重建图像。
  37. 根据权利要求36所述的方法,其特征在于,若所述P大于1时,所述根据所述P个预测图像,确定所述当前图像的目标预测图像,包括:
    根据所述P个预测图像,确定加权图像;
    根据所述加权图像,得到所述目标预测图像。
  38. 根据权利要求36所述的方法,其特征在于,所述方法还包括:
    根据所述混合时空表征,得到所述当前图像的残差图像;
    所述根据所述P个预测图像,确定所述当前图像的目标预测图像,包括:
    根据所述P个预测图像和所述残差图像,得到所述目标预测图像。
  39. 根据权利要求38所述的方法,其特征在于,若所述P大于1,所述根据所述P个预测图像和所述残差图像,得到所述目标预测图像,包括:
    根据所述P个预测图像,确定加权图像;
    根据所述加权图像和所述残差图像,确定所述目标预测图像。
  40. 根据权利要求37或39所述的方法,其特征在于,所述根据所述P个预测图像,确定加权图像,包括:
    确定所述P个预测图像对应的权重;
    根据所述P个预测图像对应的权重,对所述P个预测图像进行加权,得到所述加权图像。
  41. 根据权利要求40所述的方法,其特征在于,所述确定所述P个预测图像对应的权重,包括:
    根据所述混合时空表征进行自适应掩膜,得到所述P个预测图像对应的权重。
  42. 根据权利要求41所述的方法,其特征在于,若所述P个预测图像包括第一预测图像和第二预测图像,所述确定所述P个预测图像对应的权重,包括:
    确定所述P个预测图像,确定所述第一预测图像对应的第一权重和所述第二预测图像对应的第二权重;
    所述根据所述P个预测图像对应的权重,对所述P个预测图像进行加权,得到所述加权图像,包括:
    根据所述第一权重和所述第二权重,对所述第一预测图像和所述第二预测图像进行加权,得到所述加权图像。
  43. 根据权利要求36所述的方法,其特征在于,所述方法还包括:
    根据所述当前图像和所述目标预测图像,确定所述当前图像的残差值;
    对所述残差值进行编码,得到残差码流。
  44. 根据权利要求43所述的方法,其特征在于,所述根据所述目标预测图像,确定所述当前图像的重建图像,包括:
    对所述残差码流进行解码,得到所述当前图像的残差值;
    根据所述目标预测图像和所述残差值,得到所述重建图像。
  45. 一种视频解码装置,其特征在于,包括:
    解码单元,用于解码第一码流,确定量化后的第一特征信息,所述第一特征信息是对当前图像和所述当前图像的前一重建图像进行特征融合得到的;
    融合单元,用于对量化后的所述第一特征信息进行多级时域融合,得到混合时空表征;
    补偿单元,用于根据所述混合时空表征对所述前一重建图像进行运动补偿,得到所述当前图像的P个预测图像,所述P为正整数;
    重建单元,用于根据所述P个预测图像,确定所述当前图像的重建图像。
  46. 一种视频编码装置,其特征在于,包括:
    融合单元,用于对当前图像以及所述当前图像的前一重建图像进行特征融合,得到第一特征信息;
    量化单元,用于对所述第一特征信息进行量化,得到量化后的所述第一特征信息;
    编码单元,用于对量化后的所述第一特征信息进行编码,得到第一码流。
  47. 一种视频编解码系统,其特征在于,包括视频编码器和视频解码器;
    所述视频解码器用于执行如权利要求1-19任一项所述的视频解码方法;
    所述视频编码器用于执行如权利要求20-44任一项所述的视频编码方法。
  48. 一种电子设备,其特征在于,包括:存储器,处理器;
    所述存储器,用于存储计算机程序;
    所述处理器,用于执行所述计算机程序以实现如上述权利要求1至19或20至44任一项所述方法。
  49. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1至19或20至44任一项所述的方法。
  50. 一种码流,其特征在于,包括如权利要求20至44任一项所述的方法得到的码流。
PCT/CN2022/090468 2022-04-29 2022-04-29 视频编解码方法、装置、设备、系统及存储介质 WO2023206420A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/090468 WO2023206420A1 (zh) 2022-04-29 2022-04-29 视频编解码方法、装置、设备、系统及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/090468 WO2023206420A1 (zh) 2022-04-29 2022-04-29 视频编解码方法、装置、设备、系统及存储介质

Publications (1)

Publication Number Publication Date
WO2023206420A1 true WO2023206420A1 (zh) 2023-11-02

Family

ID=88517008

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090468 WO2023206420A1 (zh) 2022-04-29 2022-04-29 视频编解码方法、装置、设备、系统及存储介质

Country Status (1)

Country Link
WO (1) WO2023206420A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111263161A (zh) * 2020-01-07 2020-06-09 北京地平线机器人技术研发有限公司 视频压缩处理方法、装置、存储介质和电子设备
US20210044811A1 (en) * 2018-04-27 2021-02-11 Panasonic Intellectual Property Corporation Of America Encoder, decoder, encoding method, and decoding method
CN112767534A (zh) * 2020-12-31 2021-05-07 北京达佳互联信息技术有限公司 视频图像处理方法、装置、电子设备及存储介质
CN113068041A (zh) * 2021-03-12 2021-07-02 天津大学 一种智能仿射运动补偿编码方法
CN113269133A (zh) * 2021-06-16 2021-08-17 大连理工大学 一种基于深度学习的无人机视角视频语义分割方法
CN113298894A (zh) * 2021-05-19 2021-08-24 北京航空航天大学 一种基于深度学习特征空间的视频压缩方法
CN114049258A (zh) * 2021-11-15 2022-02-15 Oppo广东移动通信有限公司 一种用于图像处理的方法、芯片、装置及电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210044811A1 (en) * 2018-04-27 2021-02-11 Panasonic Intellectual Property Corporation Of America Encoder, decoder, encoding method, and decoding method
CN111263161A (zh) * 2020-01-07 2020-06-09 北京地平线机器人技术研发有限公司 视频压缩处理方法、装置、存储介质和电子设备
CN112767534A (zh) * 2020-12-31 2021-05-07 北京达佳互联信息技术有限公司 视频图像处理方法、装置、电子设备及存储介质
CN113068041A (zh) * 2021-03-12 2021-07-02 天津大学 一种智能仿射运动补偿编码方法
CN113298894A (zh) * 2021-05-19 2021-08-24 北京航空航天大学 一种基于深度学习特征空间的视频压缩方法
CN113269133A (zh) * 2021-06-16 2021-08-17 大连理工大学 一种基于深度学习的无人机视角视频语义分割方法
CN114049258A (zh) * 2021-11-15 2022-02-15 Oppo广东移动通信有限公司 一种用于图像处理的方法、芯片、装置及电子设备

Similar Documents

Publication Publication Date Title
CN109218727B (zh) 视频处理的方法和装置
TW202247650A (zh) 使用機器學習系統進行隱式圖像和視訊壓縮
US11677987B2 (en) Joint termination of bidirectional data blocks for parallel coding
WO2022155974A1 (zh) 视频编解码以及模型训练方法与装置
WO2023279961A1 (zh) 视频图像的编解码方法及装置
WO2022253249A1 (zh) 特征数据编解码方法和装置
WO2023039859A1 (zh) 视频编解码方法、设备、系统、及存储介质
WO2022266955A1 (zh) 图像解码及处理方法、装置及设备
TW202239209A (zh) 用於經學習視頻壓縮的多尺度光流
US20240007637A1 (en) Video picture encoding and decoding method and related device
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
CN116508320A (zh) 基于机器学习的图像译码中的色度子采样格式处理方法
WO2023098688A1 (zh) 图像编解码方法和装置
WO2023206420A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2022179509A1 (zh) 音视频或图像分层压缩方法和装置
WO2023225808A1 (en) Learned image compress ion and decompression using long and short attention module
WO2023184088A1 (zh) 图像处理方法、装置、设备、系统、及存储介质
WO2023220969A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2023000182A1 (zh) 图像编解码及处理方法、装置及设备
WO2023050433A1 (zh) 视频编解码方法、编码器、解码器及存储介质
TWI834087B (zh) 用於從位元流重建圖像及用於將圖像編碼到位元流中的方法及裝置、電腦程式產品
WO2023165487A1 (zh) 特征域光流确定方法及相关设备
US20240020884A1 (en) Online meta learning for meta-controlled sr in image and video compression
WO2023221599A1 (zh) 图像滤波方法、装置及设备
WO2024073213A1 (en) Diffusion-based data compression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939268

Country of ref document: EP

Kind code of ref document: A1