WO2023185305A1 - 编码方法、装置、存储介质及计算机程序产品 - Google Patents

编码方法、装置、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023185305A1
WO2023185305A1 PCT/CN2023/076925 CN2023076925W WO2023185305A1 WO 2023185305 A1 WO2023185305 A1 WO 2023185305A1 CN 2023076925 W CN2023076925 W CN 2023076925W WO 2023185305 A1 WO2023185305 A1 WO 2023185305A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
feature
motion
inter
residual
Prior art date
Application number
PCT/CN2023/076925
Other languages
English (en)
French (fr)
Inventor
师一博
王晶
葛运英
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023185305A1 publication Critical patent/WO2023185305A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • the present application relates to the field of data compression technology, and in particular to an encoding method, device, storage medium and computer program product.
  • Video encoding can reduce the pressure on network bandwidth occupied by video storage and video transmission.
  • Video coding is also called video compression.
  • the essence of video coding is to remove redundant information in the video in order to achieve the purpose of using less data (video code stream) to represent the original video.
  • Video coding includes intra-frame prediction coding and inter-frame prediction coding. Intra-frame prediction coding does not require the use of reference frames. Inter-frame prediction coding needs to use the current frame and reference frames to determine inter-frame motion information, and use inter-frame motion information to compress video. The key to video coding is how to utilize inter-frame motion information more effectively. Therefore, current related research on video coding is increasingly focusing on inter-frame prediction coding. However, the prediction accuracy of inter-frame motion information in some current inter-frame predictive coding schemes is low.
  • Embodiments of the present application provide a coding method, device, storage medium and computer program product, which can improve the prediction accuracy of inter-frame motion information, thereby improving compression performance.
  • the technical solutions are as follows:
  • the first aspect provides an encoding method, which includes:
  • the current feature is the feature of the current image to be encoded
  • the reference feature is the feature of the reference image of the current image
  • determine the correlation matrix of the reference feature relative to the current feature based on the correlation matrix, determine the inter-frame Motion features; encode the inter-frame motion features into the code stream.
  • the correlation matrix can represent the parts with strong and weak correlations between the current feature and the reference feature, the inter-frame motion information corresponding to the parts with strong correlation Therefore, in the process of fitting inter-frame motion, based on the size of each element in the correlation matrix, the inter-frame motion corresponding to the stronger correlation part can be better fitted, and less attention is paid to The weakly correlated parts correspond to inter-frame motion.
  • the correlation matrix has an information enhancement effect on the prediction of inter-frame motion features, which can improve the prediction accuracy of inter-frame motion features, thereby improving compression performance.
  • determining inter-frame motion features based on the correlation matrix includes: inputting the correlation matrix into a motion coding network to obtain inter-frame motion features; or inputting the correlation matrix, current features and reference features into the motion coding network encoding network to obtain inter-frame motion features; or, input the correlation matrix, the current image and the reference image into the motion encoding network to obtain inter-frame motion features.
  • the encoding end can directly input the correlation matrix into the motion coding network Network to obtain inter-frame motion features. You can also use the reference features and current features in the feature space and combine them with the correlation matrix to get the inter-frame motion features. You can also use the reference image and the current image in the image space and combine them with the correlation matrix to get the frame inter-movement characteristics.
  • determining inter-frame motion features includes: using reference features as prediction features, inputting the correlation matrix, prediction features and current features into the motion coding network to obtain motion features; determining the number of iterations; If the number of iterations is less than the iteration number threshold, the motion feature is input into the motion decoding network to obtain the reconstructed motion feature. Based on the reconstructed motion feature, the reference feature is transformed to re-determine the predicted feature, and the predicted feature is re-determined relative to the current feature.
  • the correlation matrix of returns to the steps of inputting the correlation matrix, predicted features and current features into the motion coding network to obtain the motion features; if the number of iterations is equal to the iteration number threshold, the motion feature is determined to be an inter-frame motion feature.
  • the encoding end improves the prediction accuracy of inter-frame motion features through multiple iterations.
  • the motion features are further enriched. details.
  • the method further includes: determining residual features based on the inter-frame motion features; and encoding the residual features into the code stream.
  • the encoding end in addition to determining and encoding inter-frame motion features, the encoding end also determines and encodes residual features, so that the decoding end can determine and encode residual features based on inter-frame motion features and residual features. to decompress the video.
  • determining residual features based on the inter-frame motion features includes: inputting the inter-frame motion features into a motion decoding network to obtain reconstructed motion features between the current image and the reference image; based on the relationship between the current image and the reference image. Reconstruct motion features between the two, transform the reference features to obtain the prediction features of the current image; determine the first residual, which is the residual between the prediction features of the current image and the current features; convert the first residual Enter the residual encoding network to get the residual features.
  • the encoding side performs transformation and prediction in feature space.
  • determining residual features based on the inter-frame motion features includes: inputting the inter-frame motion features into a motion decoding network to obtain reconstructed motion features between the current image and the reference image; based on the relationship between the current image and the reference image. Reconstruct motion features between the two, transform the reference image to obtain the predicted image; determine the second residual, which is the residual between the predicted image and the current image; input the second residual into the residual coding network, to obtain residual characteristics.
  • the encoding side performs transformation and prediction in image space.
  • the reference image is a reconstructed image of the reference frame.
  • an encoding device has the function of realizing the behavior of the encoding method in the first aspect.
  • the encoding device includes one or more modules, and the one or more modules are used to implement the encoding method provided in the first aspect.
  • an encoding device which device includes:
  • the first determination module is used to determine the current feature and the reference feature.
  • the current feature is the feature of the current image to be encoded, and the reference feature is the feature of the reference image of the current image;
  • the second determination module is used to determine the correlation matrix of the reference feature relative to the current feature
  • the third determination module is used to determine inter-frame motion characteristics based on the correlation matrix
  • the first encoding module is used to encode the inter-frame motion features into the code stream.
  • the third determination module is used for:
  • the correlation matrix, the current image and the reference image are input into the motion coding network to obtain inter-frame motion features.
  • the third determination module is used for:
  • the motion feature is input into the motion decoding network to obtain the reconstructed motion feature.
  • the reference feature is transformed to re-determine the predicted feature, and the predicted feature is re-determined relative to the current feature.
  • the correlation matrix of returns to the step of inputting the correlation matrix, predicted features and current features into the motion coding network to obtain the motion features;
  • the motion feature is determined to be an inter-frame motion feature.
  • the device also includes:
  • the fourth determination module is used to determine residual features based on the inter-frame motion features
  • the second encoding module is used to encode the residual features into the code stream.
  • the fourth determination module is used for:
  • the reference features are transformed to obtain the prediction features of the current image
  • the first residual is input into the residual coding network to obtain the residual feature.
  • the fourth determination module is used for:
  • the reference image is transformed to obtain the predicted image
  • the second residual is input into the residual coding network to obtain the residual feature.
  • the reference image is a reconstructed image of the reference frame.
  • an encoding device in a third aspect, includes a processor and an interface circuit.
  • the processor receives and/or sends data through the interface circuit, and the processor is configured to call program instructions stored in the memory to execute the encoding method provided in the first aspect.
  • the encoding device includes the memory.
  • the processor is used to determine current features and reference features.
  • the current features are features of the current image to be encoded.
  • the reference features are features of the reference image of the current image. It is also used to determine Determine a correlation matrix of the reference feature relative to the current feature, determine inter-frame motion features based on the correlation matrix, and encode the inter-frame motion features into a code stream.
  • a computer device in a fourth aspect, includes a processor and a memory.
  • the memory is used to store a program for executing the encoding method provided in the above first aspect, and to store a program for implementing the encoding method provided in the above first aspect.
  • the processor is configured to execute a program stored in the memory.
  • the computer device may also include a communication bus for establishing a connection between the processor and the memory.
  • a computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute the encoding method described in the first aspect.
  • a sixth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the encoding method described in the first aspect.
  • Figure 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of another implementation environment provided by the embodiment of the present application.
  • Figure 3 is a flow chart of an encoding method provided by an embodiment of the present application.
  • Figure 4 is a comparison diagram of reconstructed motion features provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a coding network provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a decoding network provided by an embodiment of the present application.
  • Figure 7 is a flow chart of another encoding method provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of an entropy estimation network provided by an embodiment of the present application.
  • Figure 9 is a flow chart of a coding and decoding method provided by an embodiment of the present application.
  • Figure 10 is a partial flow chart of an encoding method provided by an embodiment of the present application.
  • Figure 11 is a comparison diagram of another reconstructed motion feature provided by an embodiment of the present application.
  • Figure 12 is a coding and decoding performance comparison chart provided by the embodiment of the present application.
  • Figure 13 is another encoding and decoding performance comparison chart provided by an embodiment of the present application.
  • Figure 14 is another encoding and decoding performance comparison chart provided by an embodiment of the present application.
  • Figure 15 is another encoding and decoding performance comparison chart provided by an embodiment of the present application.
  • Figure 16 is a flow chart of a decoding method provided by an embodiment of the present application.
  • Figure 17 is a schematic structural diagram of an encoding device provided by an embodiment of the present application.
  • Figure 18 is a schematic block diagram of a coding and decoding device provided by an embodiment of the present application.
  • Pixel depth Also known as bits/pixel, BPP is the number of bits used to store each pixel. The smaller the BPP, the smaller the compression code rate.
  • Code rate In image compression, it refers to the encoding length required for unit pixel encoding. The higher the code rate, the better the image reconstruction quality.
  • PSNR Peak signal to noise ratio
  • MS-SSIM Multi-scale structural similarity index measure
  • AI Artificial Intelligence
  • CNN Convolutional neural network
  • GOP Group of pictures
  • I frames I frames
  • P frames P frames
  • B frames the basic unit of access for video image encoders and decoders.
  • I frame intra-coded frame, also called key frame. I-frames are compressed and generated without referring to other pictures. I-frames describe the details of the image background and moving subjects. During decoding, only the data of I-frames can be used to reconstruct the complete image.
  • the I frame is usually the first frame of each GOP and serves as the reference frame for random access.
  • the P frame forward predictive-coded frame, also called forward predictive frame (forward reference frame).
  • the P frame represents the difference between this frame and the previous key frame (or P frame).
  • the P frame uses motion compensation to transmit the prediction error and motion vector between it and the previous I or P frame, which is required during decoding. Use the previously cached picture to superimpose the difference represented by this frame to generate a complete image.
  • B frame Bidirectionally predicted frame, also known as bidirectional interpolation frame and bidirectional reference frame.
  • the B frame uses the previous I frame or P frame and the following P frame as reference frames.
  • the B frame transmits the prediction error and motion vector between it and the two reference frames before and after.
  • the two reference frames before and after are combined to obtain a complete image.
  • Interframe prediction coding mainly consists of two parts. One part is the inter-frame prediction part, and the other part is the residual compression part.
  • the inter-frame prediction part includes the prediction and compression module of inter-frame side information, and the transformation module.
  • inter-frame side information is embodied as optical flow. During the encoding process, the images of the reference frame and the current frame are input into the optical flow estimation network to obtain the predicted optical flow and compress the optical flow.
  • inter-frame side information is embodied as motion features.
  • the image features of the current frame and the reference frame are extracted, and the image features of the current frame and the reference frame are input into the convolutional neural network to obtain the predicted motion. features, and compress motion features.
  • the transformation module usually uses the wrap operation.
  • the reference frame is transformed into the prediction result of the current frame by using inter-frame side information.
  • the prediction and compression of optical flow are completely decoupled.
  • the predicted optical flow may be able to better represent the inter-frame changes between the current frame and the reference frame, but the optical flow may not necessarily be easy to compress. , thus affecting the encoding and decoding performance.
  • the computing power requirements of the optical flow estimation network will be large, that is, the amount of calculation required to predict optical flow will be large.
  • the image features of the current frame and the reference frame are input into the convolutional neural network, thereby completely relying on the convolution operation to fit the motion between the current frame and the reference frame. The accuracy of the obtained motion features is low. It is more difficult to predict more precise motion characteristics. Thereby affecting the encoding and decoding performance.
  • Correlation matrix In the embodiment of the present application, the correlation matrix of the reference feature relative to the current feature is determined, so that the correlation matrix is used to predict more accurate inter-frame motion features.
  • the correlation matrix is also called a cross-correlation matrix, a neighborhood cross-correlation matrix, a neighborhood correlation matrix, etc.
  • One calculation way to determine the correlation matrix of one feature relative to another feature involves: given two features F1 and F2, and the neighborhood size, calculate the neighborhood correlation matrix of feature F2 relative to feature F1.
  • the size of this neighborhood is k*k, and the dimensions of features F1 and F2 are both c*h*w, where c, h, and w are the number of channels, height, and width of the feature space respectively, and h*w represents the size of the feature space.
  • the operation of calculating the neighborhood correlation matrix includes: recording the eigenvector of the midpoint (i, j) of feature F1 as Among them, i ⁇ [1,h], j ⁇ [1,w].
  • the corr() function can be any form of distance function, such as inner product, cosine (cos), L1 distance, L2 distance and other functions, or distance functions obtained by convolution learning, etc.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment includes a source device 10 , a destination device 20 , a link 30 and a storage device 40 .
  • the source device 10 may generate an encoded video, that is, a code stream. Therefore, the source device 10 may also be referred to as an encoding device.
  • the destination device 20 can decode the code stream generated by the source device 10 . Therefore, the destination device 20 may also be referred to as a decoding device.
  • Link 30 may receive encoded video generated by source device 10 and may transmit the encoded video to destination device 20 .
  • the storage device 40 can receive the encoded video generated by the source device 10 and can store the encoded video.
  • the destination device 20 can directly obtain the encoded video from the storage device 40 .
  • storage device 40 may correspond to a file server or another intermediate storage device that may hold the encoded video generated by source device 10 , in which case destination device 20 may store it via streaming or download storage device 40 of encoded video.
  • Source device 10 and destination device 20 may each include one or more processors and memory coupled to the one or more processors, which memory may include random access memory (RAM), read-only memory ( read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory, which can be used to store desired programs in the form of instructions or data structures that can be accessed by a computer Code for any other media etc.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory which can be used to store desired programs in the form of instructions or data structures that can be accessed by a computer Code for any other media etc.
  • both the source device 10 and the destination device 20 may include a mobile phone, a smart phone, a personal digital assistant (PDA), a wearable device, a pocket PC (PPC), a tablet computer, a smart car, a smartphone Televisions, smart speakers, desktop computers, mobile computing devices, notebooks (e.g., laptops), tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, televisions, cameras, display devices, Digital media player, video game console, vehicle computer or the like.
  • PDA personal digital assistant
  • PPC pocket PC
  • Link 30 may include one or more media or devices capable of transmitting encoded video from source device 10 to destination device 20 .
  • link 30 may include one or more communication media that enables source device 10 to send encoded video directly to destination device 20 in real time.
  • the source device 10 may modulate the encoded video based on a communication standard, which may be a wireless communication protocol, etc., and may send the modulated video to the destination device 20 .
  • the one or more communication media may include wireless and/or wired communication media.
  • the one or more communication media may include a radio frequency (RF) spectrum or one or more physical transmission lines.
  • RF radio frequency
  • the one or more communication media may form part of a packet-based network, which may be a local area network, Wide area network or global network (e.g., Internet), etc.
  • the one or more communication media may include routers, switches, base stations, or other equipment that facilitates communication from the source device 10 to the destination device 20, etc., which are not specifically limited in the embodiments of the present application.
  • the storage device 40 can store the received encoded video sent by the source device 10 , and the destination device 20 can directly obtain the encoded video from the storage device 40 .
  • the storage device 40 may include any of a variety of distributed or locally accessed data storage media.
  • any of the multiple distributed or locally accessed data storage media may be Hard drive, Blu-ray Disc, digital versatile disc (DVD), compact disc read-only memory (CD-ROM), flash memory, volatile or non-volatile memory, or Any other suitable digital storage media for storing code streams, etc.
  • the storage device 40 may correspond to a file server or another intermediate storage device that may save the code stream generated by the source device 10 , and the destination device 20 may store the code stream via streaming or downloading the storage device 40 Image.
  • the file server may be any type of server capable of storing the encoded video and sending the encoded video to destination device 20 .
  • the file server may include a network server, a file transfer protocol (FTP) server, a network attached storage (network attached storage, NAS) device or a local disk drive, etc.
  • Destination device 20 may obtain the encoded images over any standard data connection, including an Internet connection.
  • Any standard data connection may include a wireless channel (e.g., Wi-Fi connection), a wired connection (e.g., digital subscriber line (DSL), cable modem, etc.), or may be suitable for retrieving encoded data stored on a file server
  • the video is a combination of the two.
  • the transmission of the encoded video from storage device 40 may be a streaming transmission, a download transmission, or a combination of both.
  • the implementation environment shown in Figure 1 is only one possible implementation, and the technology of the embodiment of the present application can not only be applied to the source device 10 shown in Figure 1 that can encode images, but also can encode the encoded video.
  • the decoding destination device 20 can also be applied to other devices that can encode videos and decode code streams, and this is not specifically limited in the embodiment of the present application.
  • the source device 10 includes a data source 120 , an encoder 100 and an output interface 140 .
  • the output interface 140 may include a regulator/demodulator (modem) and/or a transmitter, where the transmitter may also be referred to as a transmitter.
  • Data source 120 may include a video capture device (e.g., a video camera, etc.), an archive containing previously captured video, a feed interface for receiving video from a video content provider, and/or a computer graphics system for producing the video, or A combination of these sources of video.
  • the data source 120 may send a video to the encoder 100, and the encoder 100 may encode the video sent by the data source 120 to obtain an encoded video.
  • the encoder can send the encoded video to the output interface.
  • source device 10 sends the encoded video directly to destination device 20 via output interface 140 .
  • the encoded video may also be stored on storage device 40 for later retrieval by destination device 20 for decoding and/or display.
  • the destination device 20 includes an input interface 240 , a decoder 200 and a display device 220 .
  • input interface 240 includes a receiver and/or modem.
  • the input interface 240 may receive the encoded video via the link 30 and/or from the storage device 40 and then send it to the decoder 200.
  • the decoder 200 may decode the received encoded video to obtain the decoded video.
  • the decoder may send the decoded video to display device 220 .
  • Display device 220 may be integrated with destination device 20 or may be external to destination device 20 . Generally, display device 220 displays the decoded video.
  • Display device 220 may be any of a variety of types of display devices, for example, The display device 220 may be a liquid crystal display (LCD), a plasma display, an organic light-emitting diode (OLED) display, or other types of display devices.
  • LCD liquid crystal display
  • plasma display a plasma display
  • OLED organic light-emitting diode
  • encoder 100 and decoder 200 may be integrated with the encoder and decoder, respectively, and may include appropriate multiplexer-demultiplexers.
  • MUX-DEMUX MUX-DEMUX unit or other hardware and software for encoding of both audio and video in a common data stream or in separate data streams.
  • the MUX-DEMUX unit may conform to the ITU H.223 multiplexer protocol, or other protocols such as user datagram protocol (UDP), if applicable.
  • the encoder 100 and the decoder 200 may each be any of the following circuits: one or more microprocessors, digital signal processing (DSP), application specific integrated circuit (ASIC) ), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the technology of embodiments of the present application is implemented partially in software, the device may store instructions for the software in a suitable non-volatile computer-readable storage medium, and may use one or more processors in hardware The instructions are executed to implement the technology of the embodiments of the present application. Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered one or more processors. Each of the encoder 100 and the decoder 200 may be included in one or more encoders or decoders, either of which may be integrated into a combined encoding in the respective device. part of the encoder/decoder (codec).
  • Embodiments of the present application may generally refer to encoder 100 as “signaling” or “sending” certain information to another device, such as decoder 200.
  • the term “signaling” or “sending” may generally refer to the transmission of syntax elements and/or other data used to decode compressed video. This transfer can occur in real time or near real time. Alternatively, this communication may occur over a period of time, such as may occur when encoding to store the syntax elements in the encoded bitstream to a computer-readable storage medium, and the decoding device may then occur after storing the syntax elements to such media. retrieve this syntax element at any time.
  • FIG. 2 is a schematic diagram of another implementation environment provided by the embodiment of the present application.
  • the implementation environment includes an encoding end and a decoding end.
  • the encoding end includes an AI encoding module, an entropy encoding module and a file sending module.
  • the decoding end includes a file loading module, an entropy decoding module and an AI decoding module.
  • the inter-frame motion features and residual features to be encoded are obtained through the AI coding unit, and the inter-frame motion features and residual features are entropy-encoded to obtain the code stream, that is, Compressed video file.
  • the encoding end saves the compressed file.
  • the compressed file is transmitted to the decoding end, which loads the compressed file and obtains the decompressed video through entropy decoding and AI decoding units.
  • the AI coding unit includes one or more of the following image feature extraction network, motion coding network, residual coding network, entropy estimation network, motion decoding network or residual decoding network
  • the AI decoding unit includes One or more of the following motion decoding network, residual decoding network or entropy estimation network.
  • the data processing process of the AI encoding unit and AI decoding unit is implemented on the embedded neural network processor (neural network processing unit, NPU) to improve data processing efficiency.
  • NPU neural network processing unit
  • the processes of entropy encoding, file saving and loading are implemented on Implemented on the central processing unit (CPU).
  • the encoding end and the decoding end are one device, or the encoding end and the decoding end are two independent devices. That is, for a device, the device has both a video compression function and a video decompression function, or the device has a video compression function or a video decompression function.
  • any of the encoding methods below can be executed by the encoding end. Any of the decoding methods below can be performed by the decoding end.
  • Figure 3 is a flow chart of an encoding method provided by an embodiment of the present application, and the method is applied to the encoding end. Please refer to Figure 3.
  • the method includes the following steps.
  • Step 301 Determine the current feature and the reference feature.
  • the current feature is the feature of the current image to be encoded
  • the reference feature is the feature of the reference image of the current image.
  • the current image to be encoded corresponds to a reference image.
  • a P frame image corresponds to a reference image
  • the reference image is an I frame or P frame image before the P frame.
  • a B frame image corresponds to two reference images, namely an I frame or P frame image before the B frame, and a P frame image after the B frame.
  • one implementation method for the encoding end to determine the current features and reference features is to input the current image into the image feature extraction network to obtain the current features, and input the reference image into the image feature extraction network to obtain Reference characteristics.
  • the current feature is the feature of the current image to be encoded
  • the reference feature is the feature of the reference image of the current image.
  • the encoding end can also extract image features through other implementation methods, such as principal component analysis, statistics-based methods, etc.
  • the image feature extraction network in the embodiment of the present application is pre-trained, and the network structure and training method of the image feature extraction network are not limited in the embodiment of the present application.
  • the image feature extraction network can be based on a network built by a fully connected network or a convolutional neural network.
  • the convolution in the convolutional neural network can be 2D convolution or 3D convolution.
  • the embodiments of the present application do not limit the number of network layers included in the image feature extraction network and the number of nodes in each layer.
  • the image feature extraction network is a network built based on Resblock.
  • the reference image is a reconstructed image of the reference frame.
  • the reference frame is the reference frame of the current frame to be encoded
  • the original image of the current frame to be encoded is the current image.
  • the reconstructed image of the reference frame is an image obtained by compressing the original image of the reference frame and then decompressing it according to the encoding method provided by the embodiment of the present application.
  • the reference image is the original image of the reference frame.
  • Step 302 Determine the correlation matrix of the reference feature relative to the current feature.
  • a correlation matrix is introduced in the solution.
  • the encoding end determines the correlation matrix of the reference feature relative to the current feature.
  • the correlation matrix For the calculation method of the correlation matrix, please refer to the introduction above.
  • Step 303 Based on the correlation matrix, determine inter-frame motion characteristics.
  • the encoding end inputs the correlation matrix, current features and reference features into the motion coding network to obtain inter-frame motion features.
  • the encoding end inputs the correlation matrix, the current image and the reference image into the motion coding network to obtain inter-frame motion features.
  • the encoding end inputs the correlation matrix into the motion coding network to obtain inter-frame motion features.
  • the encoding end uses the reference features as prediction features, and inputs the correlation matrix, prediction features, and current features into the motion coding network to obtain motion features.
  • the encoding end determines the number of iterations. If the number of iterations is less than the iteration number threshold, the encoding end inputs the motion features into the motion decoding network to obtain reconstructed motion features. Create motion features, transform the reference features to re-determine the predicted features, re-determine the correlation matrix of the predicted features relative to the current features, return to execution and input the correlation matrix, predicted features and current features into the motion coding network to obtain Steps for motion characterization. If the iteration number is equal to the iteration number threshold, the encoding end determines the motion feature as an inter-frame motion feature.
  • the number of iterations is equal to the initial value
  • the iteration number threshold is K-1, or the initial value is 1, and the iteration number threshold is K.
  • K is a positive integer greater than or equal to 1
  • K represents the total number of iteration processes.
  • the total number of iterative processes is K
  • the initial value is 1
  • the iteration number threshold is K
  • the current image x t to be encoded is the t-th frame in the video, t>0
  • the reference image of the current image x t is the reconstructed image of the reference frame
  • the current feature is F t and the reference feature is reference features
  • the correlation matrix relative to the current feature F t is C t .
  • the encoding end uses the reference image as the predicted image, and inputs the correlation matrix, the predicted image, and the current image into the motion coding network to obtain motion features.
  • the encoding end determines the number of iterations. If the number of iterations is less than the iteration number threshold, the encoding end inputs the motion features into the motion decoding network to obtain reconstructed motion features.
  • the reference image is transformed to re-determine the predicted image. , determine the predicted feature, that is, the feature of the predicted image, re-determine the correlation matrix of the predicted feature relative to the current feature, and return to the step of inputting the correlation matrix, the predicted image and the current image into the motion coding network to obtain the motion feature. If the iteration number is equal to the iteration number threshold, the encoding end determines the motion feature as an inter-frame motion feature.
  • the encoding end improves the prediction accuracy of inter-frame motion features through multiple iterations. In other words, through iterative updates of motion features, the motion features are enriched. detail.
  • the encoding end determines the inter-frame motion characteristics through one iteration, which can save encoding and decoding time.
  • FIG. 4 is a comparison diagram of reconstructed motion features provided by an embodiment of the present application.
  • Figure 4 is about the reconstructed motion features of the same image in a video.
  • the first column is the reconstructed motion features obtained after the first iterative process
  • the second column is the reconstructed motion features obtained after the second iterative process.
  • an ellipse is used to circle a part of the contrasting area. It can be seen that the edge structure of the second column is clearer and more accurate. In other words, the reconstructed motion characteristics of the second column are compared with those of the first column.
  • the reconstructed motion features have more obvious detailed information.
  • the motion coding network and the motion decoding network in the embodiment of the present application are pre-trained.
  • the network structure and training method of the motion coding network and the motion decoding network are not limited in the embodiment of the present application.
  • both the motion encoding network and the motion decoding network can be fully connected networks or convolutional neural networks, and the convolutions in the convolutional neural network can be 2D convolutions or 3D convolutions.
  • the embodiment of the present application analyzes the number of network layers and each network layer included in the motion coding network and the motion decoding network. The number of nodes in a layer is not limited either.
  • FIG. 5 is a schematic structural diagram of a coding network provided by an embodiment of the present application.
  • the encoding network may be a motion encoding network.
  • the encoding network is a convolutional neural network, which includes four convolutional layers (Conv) and three cascaded grab detection network (GDN) layers.
  • the convolution kernel size of each convolutional layer is 5 ⁇ 5
  • the number of channels of the output feature map is M
  • each convolutional layer downsamples the width and height by 2 times.
  • the structure of the coding network shown in Figure 5 is not used to limit the embodiments of the present application.
  • the size of the convolution kernel, the number of channels of the feature map, the down-sampling multiple, the number of down-sampling times, the number of convolution layers, etc. can be Adjustment.
  • FIG 6 is a schematic structural diagram of a decoding network provided by an embodiment of the present application.
  • the decoding network may be a motion decoding network.
  • the decoding network is a convolutional neural network, which includes four convolutional layers (Conv) and three cascaded grab detection network (GDN) layers.
  • the convolution kernel size of each convolution layer is 5 ⁇ 5, the number of channels of the output feature map is M or N, and each convolution layer upsamples the width and height by 2 times.
  • the structure of the decoding network shown in Figure 6 is not used to limit the embodiments of the present application.
  • the size of the convolution kernel, the number of channels of the feature map, the down-sampling multiple, the number of down-sampling times, the number of convolution layers, etc. can all be used. Adjustment.
  • Step 304 Encode the inter-frame motion features into the code stream.
  • the encoding end encodes the inter-frame motion features into the code stream, so that the subsequent decoding end can decompress the video based on the inter-frame motion features in the code stream.
  • the encoding end encodes the inter-frame motion features into the code stream through entropy coding.
  • the encoding end encodes the inter-frame motion features into the code stream through entropy coding according to the specified first probability distribution parameter.
  • the encoding end inputs the inter-frame motion feature into a super-coding network (which may also be called a super-prior network) to obtain the first super-prior feature.
  • the encoding end encodes the first super-a priori feature into the code stream through entropy coding according to the specified second probability distribution parameter.
  • the encoding end inputs the first super a priori feature (the first super a priori feature parsed from the code stream or the first super a priori feature obtained through the super encoding network) into the super decoding network to obtain the first a priori feature .
  • the encoding end determines the probability distribution parameter of the inter-frame motion feature based on the first a priori feature, and encodes the inter-frame motion feature into the code stream through entropy coding based on the probability distribution parameter of the inter-frame motion feature. It should be noted that the encoding end encodes the first super a priori feature into the code stream so that the decoder can parse the inter-frame motion features from the code stream based on the first super a priori feature.
  • the specified first probability distribution parameter and the second probability distribution parameter are probability distribution parameters determined in advance through the corresponding probability distribution estimation network.
  • the embodiments of the present application are not limited to the network structure and the used probability distribution estimation network. Training methods.
  • the network structure of the probability distribution estimation network can be a fully connected network or CNN.
  • the embodiments of the present application do not limit the number of layers included in the network structure of the probability distribution estimation network and the number of nodes in each layer.
  • the encoding end determines and encodes residual features, so that the decoding end can determine and encode the residual features based on inter-frame motion features and residual features. to decompress the video.
  • the encoding method provided by the embodiment of the present application also includes the following steps 305 and 306.
  • Step 305 Determine residual features based on the inter-frame motion features.
  • the encoding end determines the residual feature based on the inter-frame motion feature.
  • the encoding end inputs the inter-frame motion features into a motion decoding network to obtain reconstructed motion features between the current image and the reference image.
  • the encoding end transforms the reference features based on the reconstructed motion features between the current image and the reference image to obtain the prediction features of the current image.
  • the encoding end determines the first residual, which is the residual between the predicted feature of the current image and the current feature, and inputs the first residual into the residual coding network to obtain the residual feature. That is to say Yes, the encoding side performs transformation and prediction in the feature space.
  • the encoding end inputs inter-frame motion features into the motion decoding network to obtain reconstructed motion features between the current image and the reference image.
  • the encoding end transforms the reference image based on the reconstructed motion features between the current image and the reference image to obtain the predicted image.
  • the encoding end determines the second residual, which is the residual between the predicted image and the current image, and inputs the second residual into the residual coding network to obtain residual features.
  • the predicted image is the predicted image of the current image. That is, the encoding side performs transformation and prediction in image space.
  • the motion decoding network in step 305 is the same as the motion decoding network in step 303.
  • the residual coding network in step 305 is pre-trained.
  • the network structure and training method of the residual coding network are not limited in the embodiments of this application.
  • the residual coding network can be a fully connected network or a convolutional neural network, and the convolution in the convolutional neural network can be 2D convolution or 3D convolution.
  • the embodiments of the present application do not limit the number of network layers included in the residual coding network and the number of nodes in each layer.
  • the network structure of the residual coding network is also as shown in Figure 5.
  • Step 306 Encode the residual features into the code stream.
  • the encoding end encodes the residual features into the code stream, so that the subsequent decoding end can decompress the video based on the inter-frame motion features and residual features in the code stream.
  • the encoding end encodes the residual features into the code stream through entropy coding.
  • the encoding end encodes the residual feature into the code stream through entropy coding according to the specified third probability distribution parameter.
  • the encoding end inputs the residual feature into the super-encoding network to obtain the second super-prior feature.
  • the encoding end encodes the second super-a priori feature into the code stream through entropy coding according to the specified fourth probability distribution parameter.
  • the encoding end inputs the second super a priori feature (the second super a priori feature parsed from the code stream or the second super a priori feature obtained through the super encoding network) into the super decoding network to obtain the second a priori feature .
  • the encoding end determines the probability distribution parameter of the residual feature based on the second a priori feature, and encodes the residual feature into the code stream through entropy coding based on the probability distribution parameter of the residual feature. It should be noted that the encoding end encodes the second super a priori feature into the code stream so that the decoder can parse the residual feature from the code stream based on the second super a priori feature.
  • the specified third probability distribution parameter and the fourth probability distribution parameter are probability distribution parameters determined in advance through the corresponding probability distribution estimation network.
  • the embodiments of the present application are not limited to the network structure and the used probability distribution estimation network. Training methods.
  • the network structure of the probability distribution estimation network can be a fully connected network or CNN.
  • the embodiments of the present application do not limit the number of layers included in the network structure of the probability distribution estimation network and the number of nodes in each layer.
  • the super-encoding network used to encode the residual features is the same as or different from the super-encoding network used to encode the inter-frame motion features, and the super-decoding network used to encode the residual features is different from the super-decoding network used to encode the inter-frame motion features. Same or different.
  • the estimated probability distribution parameters include the mean and the variance. For example, assuming that the residual characteristics of the probability distribution parameters to be estimated conform to the single Gaussian model or the mixed Gaussian model, then the probability distribution parameters of the residual characteristics obtained by the probability distribution estimation network include the mean and the variance. If any of the above probability distribution estimation networks is modeled using the Laplace distribution model, the estimated probability distribution parameters include location parameters and scale parameters. If any of the above probability distribution estimation networks is modeled using a logistic distribution model, the estimated probability distribution parameters include mean and scale parameters.
  • the probability distribution estimation network in the embodiment of the present application can also be called a factor entropy model.
  • the probability distribution estimation network is a part of the entropy estimation network.
  • the entropy estimation network also includes the above-mentioned super-encoding network and super-decoding network.
  • the super-coding network, probability distribution estimation network and super-decoding network used to encode inter-frame motion features form part or all of an entropy estimation network
  • the super-coding network, probability distribution estimation network and super-decoding network used to encode residual features The network forms part or all of another entropy estimation network.
  • FIG. 8 is a schematic structural diagram of an entropy estimation network provided by an embodiment of the present application.
  • This entropy estimation network can be any of the above An entropy estimation network.
  • the entropy estimation network includes a hyper encoder (HyEnc) network, a factor entropy model and a hyper decoder (HyDec) network.
  • the supercoding network includes three convolutional layers (Conv) and two activation layers interspersed in cascade (such as activation layers built based on Relu or other activation functions).
  • the convolution kernel size of each convolution layer is 5 ⁇ 5, and the number of channels of the output feature map is M.
  • the first two convolution layers perform 2 times downsampling on the width and height, and the last convolution layer does not downsample. sampling.
  • the network structure of the factor entropy model is the network structure of the probability distribution estimation network introduced previously.
  • the supercoding network includes three convolutional layers (Conv) and two activation layers interspersed in cascade (such as activation layers built based on Relu or other activation functions).
  • the convolution kernel size of each convolution layer is 5 ⁇ 5, and the number of channels of the output feature map is M.
  • the first convolution layer does not perform upsampling, and the last two convolution layers double the width and height. upsampling.
  • the structure of the entropy estimation network shown in Figure 8 is not used to limit the embodiments of the present application, for example, the convolution kernel size, the number of channels of the feature map, downsampling multiples, downsampling times, upsampling multiples, upsampling The number of times, the number of convolution layers, etc. can be adjusted.
  • the encoding end first obtains the residual (the first residual or the second residual), then obtains the residual features, and then encodes the residual features, which is equivalent to encoding the residual The difference is compressed.
  • the encoding end may directly encode the residual into the code stream, that is, the residual may not be compressed.
  • Figure 9 is a flow chart of a video encoding and decoding method provided by an embodiment of the present application.
  • the current image x t to be encoded is the t-th frame in the video, t>0
  • the reference image of the current image x t is the reconstructed image of the reference frame
  • the encoding process of encoding the current image x t includes the following steps 901 to 910.
  • Step 901 Convert the current image x t and the reference image Input the image feature extraction network respectively to obtain the current feature F t and the reference feature Current feature F t and reference feature The dimensions are all c1*h1*w1.
  • Step 902 Calculate reference features Correlation matrix C t relative to current feature F t .
  • the dimension of the correlation matrix C t is k*k*h1*w1.
  • Step 903 Combine the correlation matrix C t , the current feature F t and the reference feature Input motion encoding network to obtain inter-frame motion features
  • the motion characteristics between frames is the motion feature to be encoded.
  • inter-frame motion features The dimensions are c1*h2*w2, usually h2 ⁇ h1, w2 ⁇ w1.
  • Step 904 Determine inter-frame motion features through an entropy estimation network The probability distribution parameters corresponding to each element in , such as mean ⁇ m,t and variance ⁇ m,t .
  • Step 905 Based on inter-frame motion features
  • the probability distribution parameters corresponding to each element in , and the inter-frame motion features are encoded through entropy coding Encoded code stream.
  • the code stream is a bit stream
  • the bit sequence obtained by entropy coding is a partial bit sequence included in the code stream. This part of the bit sequence is called a motion information code stream or a motion information bit stream.
  • Step 906 Convert inter-frame motion features to Input the motion decoding network to obtain the reconstructed motion feature M t .
  • the dimension of the reconstructed motion feature M t is c2*h1*w1, usually c2 ⁇ c1.
  • c2 may be greater than or equal to c1.
  • Step 907 Using the reconstructed motion feature M t , the reference feature Transformed into predictive features Predictive features The dimensions are c1*h1*w1.
  • Step 908 Calculate the current feature F t and the predicted feature The residual between them is input into the residual coding network to obtain the residual features.
  • the residual characteristics is the residual feature to be encoded.
  • Step 909 Determine residual characteristics through entropy estimation network The probability distribution parameters corresponding to each element in , such as the mean ⁇ r,t and the variance ⁇ m,t .
  • Step 910 Based on residual features The probability distribution parameters corresponding to each element in , and the residual features are encoded through entropy coding Encoded code stream. Assuming that the code stream is a bit stream, for the residual features The bit sequence obtained by entropy coding is a partial bit sequence included in the code stream. This part of the bit sequence is called a residual information code stream or a motion information bit stream.
  • the encoding end will also use the residual feature Enter the residual decoding network to get the reconstructed residuals.
  • the encoding side is based on predictive features and the reconstructed residual to obtain the reconstructed features of the current image x t , and input the reconstructed features of the current image x t into the image reconstruction network to obtain the reconstructed image of the current frame
  • the image reconstruction network is not shown in Figure 9.
  • the image reconstruction network may be a deconvolution network, and the image reconstruction network may match the above image feature extraction network.
  • the residual decoding network is pre-trained, and the network structure and training method of the residual decoding network are not limited in the embodiments of this application.
  • the residual decoding network can be a fully connected network or a convolutional neural network.
  • the embodiments of the present application do not limit the number of network layers included in the residual decoding network and the number of nodes in each layer.
  • the network structure of the residual decoding network can be as shown in Figure 6.
  • the above step 903 can be replaced by: converting the correlation matrix C t , the current image x t and the reference image Input motion encoding network to obtain inter-frame motion features Another embodiment is thus obtained.
  • the above step 907 can also be replaced by: using the reconstructed motion feature M t , convert the reference image into Transform to predicted image Replace step 908 with: Calculate current image x t and predicted image The residual between them is input into the residual coding network to obtain the residual features.
  • the reconstructed image of the current frame is subsequently obtained.
  • the difference between the steps and the steps introduced in the previous paragraph is: based on the predicted image and the reconstructed residual to obtain the reconstructed image of the current frame
  • inter-frame motion features are not obtained through multiple iterations. If you want to reconstruct motion features with richer details, you can apply the iterative process introduced in step 303 above (as shown in the iterative process in Figure 10 below) to the video encoding and decoding method shown in Figure 9, so that through multiple iterations Obtain more accurate inter-frame motion features.
  • Figure 10 is a partial flow chart of an encoding method provided by an embodiment of the present application.
  • the total number of iterative processes is K
  • the initial value is 0
  • the iteration number threshold is K-1
  • the current image x t to be encoded is the t-th frame in the video, t>0
  • the current image x t The reference image is the reconstructed image of the reference frame
  • the encoding process of encoding the current image x t includes the following steps 1001 to 1010.
  • Step 1001 Combine the current image x t and the reference image Input the image feature extraction network respectively to obtain the current feature F t and the reference feature
  • Step 1002 Calculate reference features Correlation matrix C t relative to current feature F t .
  • Step 1003 Add reference features Predictive features processed as first iteration Use the correlation matrix C t as the predictive feature Correlation matrix relative to the current feature F t .
  • Step 1004 Convert the correlation matrix C t and predicted features and the current feature F t are input into the motion encoding network to obtain the motion feature And determine the number of iterations i. Among them, the number of iterations i determined for the first time is 0, and the number of iterations determined each time thereafter is equal to the number of iterations determined last time plus 1.
  • Step 1006 Convert motion features to Input the motion decoding network to obtain the reconstructed motion feature M t .
  • Step 1007 Based on the reconstructed motion feature M t , compare the reference features Transform to re-predict features
  • Step 1008 Redetermine prediction features Relative to the correlation matrix C t of the current feature F t , return to step 1004.
  • Step 1009 Determine inter-frame motion features through an entropy estimation network The probability distribution parameters corresponding to each element in .
  • Step 1010 Based on inter-frame motion features The probability distribution parameters corresponding to each element in dynamic characteristics Encoded code stream.
  • Step 1011 Convert inter-frame motion features to Input the motion decoding network to obtain the reconstructed motion feature M t .
  • Step 1012 Using the reconstructed motion feature M t , convert the reference feature Transformed into predictive features
  • Step 1013 Calculate the current feature F t and the predicted feature The residual between them is input into the residual coding network to obtain the residual features.
  • Step 1014 Determine residual characteristics through entropy estimation network The probability distribution parameters corresponding to each element in .
  • Step 1015 Based on residual features The probability distribution parameters corresponding to each element in , and the residual features are encoded through entropy coding Encoded code stream.
  • Figure 11 is a comparison diagram of another reconstructed motion feature provided by an embodiment of the present application.
  • Figure 11 is about the reconstructed motion features of the same image in a certain video.
  • the first row is the reconstructed motion features obtained by the video encoding and decoding method shown in Figure 9.
  • the second row is the reconstructed motion features obtained by the related technology.
  • This related technology inputs the image features of the current frame and the reference frame into a convolutional neural network, and completely relies on the convolution operation to fit the motion between the current frame and the reference frame. It can be seen that the edge structure of the first row is clearer and more accurate. In other words, the reconstructed motion features of the first row have more obvious detailed information than the reconstructed motion features of the second row.
  • this solution can be applied to both P frames and B frames.
  • the encoding process introduced above takes P frames as an example.
  • the current image corresponds to two reference images.
  • the encoding end obtains two inter-frame motion features and two residual features according to the aforementioned method, and combines these two The inter-frame motion features and the two residual features are encoded into the code stream.
  • the two inter-frame motion features correspond to the two reference images respectively
  • the two residual features also correspond to the two reference images respectively.
  • the encoding end obtains two inter-frame motion features based on the two reference images according to the foregoing method, and encodes the two inter-frame motion features into the code stream. Based on these two inter-frame motion features, the encoding end transforms the reference features to obtain two prediction features respectively, fuses these two prediction features to obtain a fused prediction feature, and obtains a residual based on the fusion of prediction features and current features. Features, encode the residual features into the code stream.
  • the test set includes Class B, C and D videos from the Joint Video Experts Team (JVET) standard test set, as well as some videos from the YUV_CTC video set.
  • JVET Joint Video Experts Team
  • the resolution of Class B video is 1920*1080
  • the resolution of Class C video is 832*480
  • the resolution of Class D video is 416*240.
  • Figures 12 to 15 are coding and decoding performance comparison diagrams provided by embodiments of the present application. Among them, Figures 12 to 14 are the test results for Class B, Class C and Class D videos respectively, and Figure 15 is the test results for the videos in the YUV_CTC video set.
  • “Corr” represents this solution, which is an inter-frame prediction coding scheme based on correlation matrix.
  • “Optical flow” in the legend represents comparison scheme 1, which is an inter-frame prediction coding scheme based on optical flow estimation.
  • “FVC” in the legend represents comparison scheme 2, that is, a coding scheme that obtains inter-frame motion features based only on the features of the current frame and the reference frame. Since the higher the performance index PSNR is, the better the reconstructed image quality is. The larger the BPP is, the lower the compression rate is and the lower the reconstructed image quality is. It can be seen that the curve corresponding to this scheme is further up and to the left, indicating that the encoding and decoding performance of this scheme is relatively better, that is, the compression performance is better.
  • this solution does not rely on optical flow, reduces the amount of calculation required to calculate optical flow, and makes it easier to predict inter-frame motion features, and it is easier to predict inter-frame motion features that are more conducive to compression.
  • This solution introduces a correlation matrix, which helps improve the prediction accuracy of inter-frame motion features, makes inter-frame prediction and compression easier, and helps improve the fitting and generalization capabilities of the entire encoding and decoding model.
  • the inter-frame motion is fitted by introducing a correlation matrix. Since the correlation matrix It can characterize the parts with strong and weak correlations between the current feature and the reference feature. The parts with strong correlation correspond to more abundant inter-frame motion information. Therefore, in the process of fitting inter-frame motion, based on the correlation The size of each element in the matrix can better fit the inter-frame motion corresponding to the parts with strong correlation, and pay less attention to the inter-frame motion corresponding to the parts with weak correlation.
  • the correlation matrix has an information enhancement effect on the prediction of inter-frame motion features, that is, it can improve the prediction accuracy of inter-frame motion features, thereby improving compression performance.
  • Figure 16 is a flow chart of a decoding method provided by an embodiment of the present application. This method is applied to the decoder side. Referring to Figure 16, the method includes the following steps.
  • Step 1601 Parse inter-frame motion features and residual features from the code stream.
  • the inter-frame motion features encoded into the code stream are determined based on the correlation matrix of the reference features relative to the current features.
  • the decoder parses the inter-frame motion features from the code stream through entropy decoding.
  • the decoder parses the inter-frame motion features from the code stream through entropy decoding according to the specified first probability distribution parameter.
  • the decoder parses the first super prior feature from the code stream according to the specified second probability distribution parameter, inputs the first super prior feature into the super decoding network, and obtains the first prior feature .
  • the decoder determines the probability distribution parameter of the inter-frame motion feature based on the first a priori feature, and parses the inter-frame motion feature from the code stream based on the probability distribution parameter of the inter-frame motion feature.
  • the decoder parses the residual feature from the code stream through entropy decoding.
  • the decoder parses the residual feature from the code stream through entropy decoding according to the specified third probability distribution parameter.
  • the decoder parses the second super prior feature from the code stream according to the specified fourth probability distribution parameter, inputs the second super prior feature into the super decoding network, and obtains the second prior feature .
  • the decoding end determines the probability distribution parameter of the residual feature based on the second a priori feature, and parses the residual feature from the code stream through entropy decoding based on the probability distribution parameter of the residual feature.
  • first probability distribution parameter, the second probability distribution parameter, the third probability distribution parameter and the fourth probability distribution parameter in the decoding process are the same as those in the encoding process.
  • the super decoding network used in the decoding process is the same as the encoding process.
  • the super-decoding network in the process is the same.
  • Step 1602 Based on the inter-frame motion features and reference features, determine the prediction features of the current image to be decoded.
  • the decoder inputs the inter-frame motion features into the motion decoding network to obtain reconstructed motion features between the current image and the reference image. Based on the reconstructed motion features, the decoder transforms the reference features to obtain the prediction features of the current image.
  • the implementation process is consistent with the relevant content in the coding process and will not be repeated here.
  • Step 1603 Reconstruct the current image based on the prediction features and residual features.
  • the decoder obtains the reconstruction features of the current image based on the residual features and prediction features.
  • the decoder inputs the reconstruction features of the current image into the image reconstruction network to reconstruct the current image, that is, to obtain the reconstructed image of the current frame.
  • the decoder can input the residual feature into the residual decoding network to obtain the reconstructed residual. Based on the prediction feature and the reconstructed residual, the reconstructed feature of the current image is obtained.
  • the specific implementation process is consistent with the relevant content in the above-mentioned embodiment of FIG. 9 and will not be described again here.
  • the above-mentioned steps 1602 and 1603 are replaced by: based on the inter-frame motion features and the reference image, determining the predicted image of the current image to be decoded, and based on the predicted image and residual features, reconstructing Show current characteristics.
  • the decoder can input the inter-frame motion features into the motion decoding network to obtain reconstructed motion features.
  • the decoder transforms the reference image based on the reconstructed motion features to obtain a predicted image of the current image.
  • the decoder can input the residual feature into the residual decoding network to obtain the reconstructed residual. Based on the predicted image and the reconstructed residual, the reconstruction Export the current image.
  • the decoder parses the inter-frame motion features and residuals from the code stream, and inputs the inter-frame motion features into the motion decoding network to obtain the reconstructed motion features between the current image and the reference image. . Based on the reconstructed motion features, the decoder transforms the reference features to obtain the prediction features of the current image. The decoder obtains the reconstruction features of the current image based on the prediction features and the parsed residuals, and inputs the reconstruction features of the current image into the image reconstruction network to reconstruct the current image.
  • the decoder parses the inter-frame motion features and residuals from the code stream, and inputs the inter-frame motion features into the motion decoding network to obtain the reconstructed motion features between the current image and the reference image. .
  • the decoder transforms the reference image based on the reconstructed motion features to obtain a predicted image of the current image.
  • the decoder reconstructs the current image based on the predicted image and the parsed residual.
  • the decoder parses an inter-frame motion feature and a residual feature from the code stream to decode the P frame.
  • the current image corresponds to two reference images.
  • the decoder parses two inter-frame motion features and two residual features from the code stream. The decoder uses these two inter-frame The motion feature obtains two prediction features. Based on the two residual features and the two prediction features, two reconstructed images of the current frame are obtained. The two reconstructed images are fused to reconstruct the current image.
  • the decoder parses two inter-frame motion features and a residual feature from the code stream. The decoder obtains two prediction features based on these two inter-frame motion features and fuses the two prediction features. , a fusion prediction feature is obtained, and the current image is reconstructed based on the fusion prediction feature and residual feature.
  • the decoding process in any of the above embodiments matches the encoding process. For example, if the encoding process performs transformation and prediction in image space, then the decoding process also performs transformation and prediction in image space. If the encoding process is transformed and predicted in the feature space, then the decoding process is also transformed and predicted in the feature space.
  • a correlation matrix is introduced to fit inter-frame motion.
  • the correlation matrix has an information enhancement effect on the prediction of inter-frame motion features, that is, it can improve the prediction accuracy of inter-frame motion features, thereby improving compression performance.
  • Figure 17 is a schematic structural diagram of an encoding device 1700 provided by an embodiment of the present application.
  • the encoding device 1700 can be implemented as part or all of a computer device by software, hardware, or a combination of the two.
  • the computer device can include those in the above embodiments. any encoding end.
  • the device 1700 includes: a first determination module 1701, a second determination module 1702, a third determination module 1703 and a first encoding module 1704.
  • the first determination module 1701 is used to determine the current feature and the reference feature.
  • the current feature is the feature of the current image to be encoded, and the reference feature is the feature of the reference image of the current image;
  • the second determination module 1702 is used to determine the correlation matrix of the reference feature relative to the current feature
  • the third determination module 1703 is used to determine inter-frame motion characteristics based on the correlation matrix
  • the first encoding module 1704 is used to encode the inter-frame motion features into the code stream.
  • the third determination module 1703 is used for:
  • the correlation matrix, the current image and the reference image are input into the motion coding network to obtain inter-frame motion features.
  • the third determination module 1703 is used for:
  • the motion feature is input into the motion decoding network to obtain the reconstructed motion feature.
  • the reference feature is transformed to re-determine the predicted feature, and the predicted feature is re-determined relative to the current feature.
  • the correlation matrix of returns to the step of inputting the correlation matrix, predicted features and current features into the motion coding network to obtain the motion features;
  • the motion feature is determined to be an inter-frame motion feature.
  • the device 1700 also includes:
  • the fourth determination module is used to determine residual features based on the inter-frame motion features
  • the second encoding module is used to encode the residual features into the code stream.
  • the fourth determination module is used for:
  • the reference features are transformed to obtain the prediction features of the current image
  • the first residual is input into the residual coding network to obtain the residual feature.
  • the fourth determination module is used for:
  • the reference image is transformed to obtain the predicted image
  • the second residual is input into the residual coding network to obtain the residual feature.
  • the reference image is a reconstructed image of the reference frame.
  • inter-frame motion is fitted by introducing a correlation matrix. Since the correlation matrix can represent the parts with stronger and weaker correlations between the current feature and the reference feature, the parts with stronger correlation correspond to The inter-frame motion information is richer. Therefore, in the process of fitting inter-frame motion, based on the size of each element in the correlation matrix, the inter-frame motion corresponding to the strongly correlated part can be better fitted. And less attention is paid to the inter-frame motion corresponding to the weakly correlated part.
  • the correlation matrix has an information enhancement effect on the prediction of inter-frame motion features, that is, it can improve the prediction accuracy of inter-frame motion features, thereby improving compression performance.
  • the encoding device provided in the above embodiment performs video encoding
  • only the division of the above functional modules is used as an example.
  • the above function allocation can be completed by different functional modules according to needs, that is, The internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the encoding device provided by the above embodiments and the encoding method embodiments belong to the same concept. Please refer to the method embodiments for the specific implementation process, which will not be described again here.
  • Figure 18 is a schematic block diagram of a coding and decoding device 1800 used in an embodiment of the present application.
  • the encoding and decoding device 1800 may include a processor 1801, a memory 1802, and a bus system 1803.
  • the processor 1801 and the memory 1802 are connected through a bus system 1803.
  • the memory 1802 is used to store instructions, and the processor 1801 is used to execute the instructions stored in the memory 1802 to perform various encoding or decoding described in the embodiments of this application. method. To avoid repetition, no further details will be given here. Detailed description.
  • the processor 1801 may be a central processing unit (CPU).
  • the processor 1801 may also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gates or transistor logic devices. , discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the memory 1802 may include a ROM device or a RAM device. Any other suitable type of storage device may also be used as memory 1802.
  • Memory 1802 may include code and data 18021 accessed by processor 1801 using bus 1803 .
  • the memory 1802 may further include an operating system 18023 and an application program 18022, which includes at least one program that allows the processor 1801 to perform the encoding or decoding method described in the embodiment of the present application.
  • the application program 18022 may include applications 1 to N, which further include encoding or decoding applications (referred to as encoding and decoding applications) that perform the encoding or decoding methods described in the embodiments of this application.
  • bus system 1803 may also include a power bus, a control bus, a status signal bus, etc.
  • bus system 1803 may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are labeled as bus system 1803 in the figure.
  • the codec apparatus 1800 may also include one or more output devices, such as a display 1804.
  • display 1804 may be a tactile display that incorporates a display with a tactile unit operable to sense touch input.
  • Display 1804 may be connected to processor 1801 via bus 1803.
  • the encoding and decoding device 1800 can perform the encoding method in the embodiment of the present application, and can also perform the decoding method in the embodiment of the present application.
  • Computer-readable media may include computer-readable storage media that correspond to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (e.g., based on a communications protocol) .
  • computer-readable media generally may correspond to (1) non-transitory tangible computer-readable storage media, or (2) communication media, such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this application.
  • a computer program product may include computer-readable media.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage, flash memory or may be used to store instructions or data structures any other medium that may contain the required program code in a form that is accessible by a computer.
  • any connection is properly termed a computer-readable medium.
  • coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are used to transmit instructions from a website, server, or other remote source
  • coaxial cable Wires, fiber optic cables, twisted pairs, DSL or wireless technologies such as infrared, radio and microwave are included in the definition of media.
  • computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media.
  • disks and optical discs include compact discs (CDs), laser discs, optical discs, DVDs, and Blu-ray discs, where disks typically reproduce data magnetically, while discs reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • DSPs digital signal processors
  • ASICs application-specific integrated circuits
  • FPGA programmable logic array
  • processors such as a programmable logic array (FPGA) or other equivalent integrated or discrete logic circuits, execute instructions.
  • processors may refer to any of the foregoing structures or any other structure suitable for implementing the techniques described herein.
  • FPGA programmable logic array
  • the functionality described by the various illustrative logical blocks, modules, and steps described herein may be provided within or within dedicated hardware and/or software modules configured for encoding and decoding. into the combined codec.
  • the techniques may be entirely implemented in one or more circuits or logic elements.
  • various illustrative logical blocks, units, and modules in the encoder 100 and the decoder 200 can be understood as corresponding circuit devices or logical elements.
  • inventions of the present application may be implemented in a variety of devices or devices, including wireless handsets, integrated circuits (ICs), or a set of ICs (eg, chipsets).
  • ICs integrated circuits
  • a set of ICs eg, chipsets
  • Various components, modules or units are described in the embodiments of this application to emphasize the functional aspects of the apparatus for performing the disclosed technology, but do not necessarily need to be implemented by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit in conjunction with suitable software and/or firmware, or by interoperating hardware units (including one or more processors as described above). supply.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, tapes), optical media (such as digital versatile discs (DVD)) or semiconductor media (such as solid state disks (SSD)) wait.
  • the computer-readable storage media mentioned in the embodiments of this application may be non-volatile storage media, in other words, may be non-transitory storage media.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • Signals are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions.
  • the videos, images, etc. involved in the embodiments of this application are all fully authorized obtained under circumstances.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请公开了一种编码方法、装置、存储介质及计算机程序产品,属于数据压缩技术领域。该方法通过引入相关性矩阵来拟合帧间运动,由于相关性矩阵能够表征当前特征与参考特征之间相关性较强和较弱的部分,相关性较强的部分所对应的帧间运动信息更加丰富,因此,在拟合帧间运动的过程中,基于相关性矩阵中各个元素的大小,便能够更好地拟合出相关性较强的部分对应的帧间运动,而较少地关注相关性较弱的部分对应的帧间运动。简单来说,相关性矩阵对帧间运动特征的预测有信息增强的作用,即,能够提高帧间运动特征的预测精度,进而提升压缩性能。

Description

编码方法、装置、存储介质及计算机程序产品
本申请要求于2022年03月31日提交的申请号为202210345172.7、发明名称为“编码方法、装置、存储介质及计算机程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据压缩技术领域,特别涉及一种编码方法、装置、存储介质及计算机程序产品。
背景技术
视频编码能够减轻视频存储、视频传输占用网络带宽的压力。视频编码也称为视频压缩,视频编码的本质在于去除视频中的冗余信息,以达到用较少的数据(视频码流)表示原视频的目的。视频编码包括帧内预测编码和帧间预测编码,帧内预测编码不需要利用参考帧,帧间预测编码需要利用当前帧与参考帧来确定帧间运动信息,利用帧间运动信息来压缩视频。而视频编码的关键在于如何更有效地利用帧间运动信息。因此,目前视频编码的相关研究越来越多地聚焦于帧间预测编码。然而,当前一些帧间预测编码方案中帧间运动信息的预测精度较低。
发明内容
本申请实施例提供了一种编码方法、装置、存储介质及计算机程序产品,能够提高帧间运动信息的预测精度,进而提升压缩性能。所述技术方案如下:
第一方面,提供了一种编码方法,该方法包括:
确定当前特征和参考特征,当前特征为待编码的当前图像的特征,参考特征为当前图像的参考图像的特征;确定参考特征相对于当前特征的相关性矩阵;基于该相关性矩阵,确定帧间运动特征;将该帧间运动特征编入码流。
本方案通过引入相关性矩阵来拟合帧间运动,由于相关性矩阵能够表征当前特征与参考特征之间相关性较强和较弱的部分,相关性较强的部分所对应的帧间运动信息更加丰富,因此,在拟合帧间运动的过程中,基于相关性矩阵中各个元素的大小,便能够更好地拟合出相关性较强的部分对应的帧间运动,而较少地关注相关性较弱的部分对应的帧间运动。简单来说,相关性矩阵对帧间运动特征的预测有信息增强的作用,能够提高帧间运动特征的预测精度,进而提升压缩性能。
可选地,基于该相关性矩阵,确定帧间运动特征,包括:将该相关性矩阵输入运动编码网络,以得到帧间运动特征;或者,将该相关性矩阵、当前特征和参考特征输入运动编码网络,以得到帧间运动特征;或者,将该相关性矩阵、当前图像和参考图像输入运动编码网络,以得到帧间运动特征。
应当理解的是,在本申请实施方式中,编码端既可以直接将相关性矩阵输入运动编码网 络,得到帧间运动特征,也可以利用特征空间的参考特征和当前特征,并结合相关性矩阵得到帧间运动特征,也可以利用图像空间的参考图像和当前图像,并结合相关性矩阵得到帧间运动特征。
可选地,基于该相关性矩阵,确定帧间运动特征,包括:将参考特征作为预测特征,将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征;确定迭代次数;如果迭代次数小于迭代次数阈值,则将该运动特征输入运动解码网络,以得到重建运动特征,基于该重建运动特征,对参考特征进行变换,以重新确定预测特征,重新确定预测特征相对于当前特征的相关性矩阵,返回执行将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征的步骤;如果迭代次数等于迭代次数阈值,则将该运动特征确定为帧间运动特征。
应当理解的是,在这种迭代处理的实施方式中,编码端通过多次迭代的方式来提高帧间运动特征的预测精度,换句话说,通过对运动特征的迭代更新,更加丰富了运动特征的细节。
可选地,基于该相关性矩阵,确定帧间运动特征之后,还包括:基于该帧间运动特征,确定残差特征;将该残差特征编入码流。应当理解的是,对于帧间预测编码来说,编码端除了确定并编码帧间运动特征之外,还确定残差特征并编码残差特征,以便于解码端基于帧间运动特征和残差特征来解压视频。
可选地,基于该帧间运动特征,确定残差特征,包括:将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征;基于当前图像与参考图像之间的重建运动特征,对参考特征进行变换,以得到当前图像的预测特征;确定第一残差,第一残差为当前图像的预测特征与当前特征之间的残差;将第一残差输入残差编码网络,以得到该残差特征。应当理解的是,在这种实施方式中,编码端在特征空间进行变换和预测。
可选地,基于该帧间运动特征,确定残差特征,包括:将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征;基于当前图像与参考图像之间的重建运动特征,对参考图像进行变换,以得到预测图像;确定第二残差,第二残差为预测图像与当前图像之间的残差;将第二残差输入残差编码网络,以得到残差特征。应当理解的是,在这种实施方式中,编码端在图像空间进行变换和预测。
可选地,该参考图像为参考帧的重建图像。
第二方面,提供了一种编码装置,所述编码装置具有实现上述第一方面中编码方法行为的功能。所述编码装置包括一个或多个模块,该一个或多个模块用于实现上述第一方面所提供的编码方法。
也即是,提供了一种编码装置,该装置包括:
第一确定模块,用于确定当前特征和参考特征,当前特征为待编码的当前图像的特征,参考特征为当前图像的参考图像的特征;
第二确定模块,用于确定参考特征相对于当前特征的相关性矩阵;
第三确定模块,用于基于该相关性矩阵,确定帧间运动特征;
第一编码模块,用于将该帧间运动特征编入码流。
可选地,第三确定模块用于:
将该相关性矩阵输入运动编码网络,以得到帧间运动特征;或者,
将该相关性矩阵、当前特征和参考特征输入运动编码网络,以得到帧间运动特征;或者,
将该相关性矩阵、当前图像和参考图像输入运动编码网络,以得到帧间运动特征。
可选地,第三确定模块用于:
将参考特征作为预测特征,将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征;
确定迭代次数;
如果迭代次数小于迭代次数阈值,则将该运动特征输入运动解码网络,以得到重建运动特征,基于该重建运动特征,对参考特征进行变换,以重新确定预测特征,重新确定预测特征相对于当前特征的相关性矩阵,返回执行将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征的步骤;
如果迭代次数等于迭代次数阈值,则将该运动特征确定为帧间运动特征。
可选地,该装置还包括:
第四确定模块,用于基于该帧间运动特征,确定残差特征;
第二编码模块,用于将该残差特征编入码流。
可选地,第四确定模块用于:
将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征;
基于当前图像与参考图像之间的重建运动特征,对参考特征进行变换,以得到当前图像的预测特征;
确定第一残差,第一残差为当前图像的预测特征与当前特征之间的残差;
将第一残差输入残差编码网络,以得到该残差特征。
可选地,第四确定模块用于:
将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征;
基于当前图像与参考图像之间的重建运动特征,对参考图像进行变换,以得到预测图像;
确定第二残差,第二残差为预测图像与当前图像之间的残差;
将第二残差输入残差编码网络,以得到该残差特征。
可选地,该参考图像为参考帧的重建图像。
第三方面,提供了一种编码装置,所述编码装置包括处理器和接口电路。所述处理器通过所述接口电路接收和/或发送数据,所述处理器被配置为用于调用存储在存储器中的程序指令,以执行上述第一方面所提供的编码方法。
可选地,所述编码装置包括所述存储器。在本申请实施过程中,所述处理器用于确定当前特征和参考特征,所述当前特征为待编码的当前图像的特征,所述参考特征为所述当前图像的参考图像的特征,还用于确定所述参考特征相对于所述当前特征的相关性矩阵,基于所述相关性矩阵,确定帧间运动特征,以及将所述帧间运动特征编入码流。
第四方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的编码方法的程序,以及存储用于实现上述第一方面所提供的编码方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述计算机设备还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第五方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的编码方法。
第六方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的编码方法。
上述第二方面、第三方面、第四方面、第五方面和第六方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
附图说明
图1是本申请实施例提供的一种实施环境的示意图;
图2是本申请实施例提供的另一种实施环境的示意图;
图3是本申请实施例提供的一种编码方法的流程图;
图4是本申请实施例提供的一种重建运动特征的对比图;
图5是本申请实施例提供的一种编码网络的结构示意图;
图6是本申请实施例提供的一种解码网络的结构示意图;
图7是本申请实施例提供的另一种编码方法的流程图;
图8是本申请实施例提供的一种熵估计网络的结构示意图;
图9是本申请实施例提供的一种编解码方法的流程图;
图10是本申请实施例提供的一种编码方法中的部分流程图;
图11是本申请实施例提供的另一种重建运动特征的对比图;
图12是本申请实施例提供的一个编解码性能对比图;
图13是本申请实施例提供的另一个编解码性能对比图;
图14是本申请实施例提供的又一个编解码性能对比图;
图15是本申请实施例提供的又一个编解码性能对比图;
图16是本申请实施例提供的一种解码方法的流程图;
图17是本申请实施例提供的一种编码装置的结构示意图;
图18是本申请实施例提供的一种编解码装置的示意性框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请实施例描述的系统架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着系统架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
在对本申请实施例提供的编解码方法进行详细地解释说明之前,先对本申请实施例涉及的术语和实施环境进行介绍。
为了便于理解,首先对本申请实施例涉及的部分术语和相关技术进行解释。
像素深度(bits per pixel,BPP):又称为位/像素,BPP是存储每个像素所用的位数,BPP越小代表压缩码率越小。
码率:在图像压缩中,指单位像素编码所需要的编码长度,码率越高,图像重建质量越好。
峰值信噪比(peak signal to noise ratio,PSNR):是一种评价图像质量的客观标准,PSNR越高代表图像质量越好。
多尺度结构相似性(multi-scale structural similarity index measure,MS-SSIM):是一种评价图像的客观标准,MS-SSIM越高代表图像质量越好。
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术以及应用系统的一门技术科学。
卷积神经网络(convolution neural network,CNN):是一种包含卷积计算且具有深度结构的前馈神经网络,是深度学习的代表算法之一。
图像组(group of pictures,GOP):视频的码流包括多个GOP。GOP是一组连续的画面,由I帧、P帧和/或B帧组成,是视频图像编码器和解码器存取的基本单位。
I帧:内部编码(intra-coded)帧,也称为关键帧。I帧不需要参考其他画面而压缩生成,I帧描述了图像背景和运动主体的详情,解码时仅用I帧的数据就可重构完整图像。I帧通常是每个GOP的第一个帧,作为随机访问的参考帧。
P帧:前向预测编码(predictive-coded)帧,也称为前向预测帧(前向参考帧)。P帧表示的是这一帧跟之前的一个关键帧(或P帧)的差别,P帧采用运动补偿的方法传送它与前面的I或P帧之间的预测误差及运动矢量,解码时需要用之前缓存的画面叠加上本帧表示的差别,生成完整图像。
B帧:双向预测(bidirectionally predicted)帧,也称为双向内插帧、双向参考帧。B帧以前面的一个I帧或P帧,以及后面的P帧为参考帧,B帧传送的是它与前后两个参考帧之间的预测误差及运动矢量。解码时根据运动矢量和预测误差,结合前后两个参考帧,从而得到完整图像。
I帧采用的是帧内预测编码,P帧和B帧采用的是帧间预测编码,P帧和B帧相对I帧来说具有更高的压缩率。帧间预测编码主要包括两部分。一部分为帧间预测部分,另一部分为残差压缩部分。帧间预测部分包括帧间边信息的预测和压缩模块,以及变换模块。在一相关技术中帧间边信息体现为光流,在编码过程中,将参考帧和当前帧的图像输入光流估计网络,以得到预测的光流,并压缩光流。在另一相关技术中帧间边信息体现为运动特征,在编码过程中,提取当前帧和参考帧的图像特征,将当前帧和参考帧的图像特征输入卷积神经网络,以得到预测的运动特征,并压缩运动特征。变换模块通常采用wrap操作,在编码过程中,利用帧间边信息,将参考帧变换为当前帧的预测结果。
然而,相关技术中光流的预测和压缩完全解耦,会存在预测得到的光流可能能够较好地表征当前帧与参考帧之间的帧间变化,但是该光流不一定好压缩的问题,从而影响编解码性能。另外,光流估计网络的算力要求会较大,即预测光流的计算量较大。另一相关技术中将当前帧和参考帧的图像特征输入卷积神经网络,从而完全依赖卷积运算来拟合当前帧与参考帧之间的运动,所得到的运动特征的精确度较低,想要预测出更精确的运动特征的难度较大, 从而影响编解码性能。
相关性矩阵:在本申请实施例中通过确定参考特征相对于当前特征的相关性矩阵,从而具有相关性矩阵来预测更精准的帧间运动特征。相关性矩阵也称为互相关矩阵、邻域互相关矩阵、邻域相关性矩阵等。
确定一个特征相对于另一个特征的相关性矩阵的一种计算方式包括:给定两个特征F1和F2,以及邻域大小,计算特征F2相对于特征F1的邻域相关性矩阵。该邻域大小为k*k,特征F1和F2的维度均为c*h*w,其中,c、h、w分别为特征空间的通道数、高和宽,h*w表征特征空间的大小。计算邻域相关性矩阵的操作包括:将特征F1中点(i,j)的特征向量记为其中,i∈[1,h],j∈[1,w]。计算在特征F2中以点(i,j)为中心,k*k大小的邻域范围内所有点的特征向量的相关性值最终对于特征F1中的每个点得到一个k*k大小的相关性值的集合,从而形成一个维度为k*k*h*w的矩阵,即特征F2相对于特征F1的相关性矩阵。其中,corr()函数可以是任意形式的距离函数,例如内积、余弦(cos)、L1距离、L2距离等函数,又如由卷积学习得到的距离函数等。
接下来对本申请实施例涉及的实施环境进行介绍。
请参考图1,图1是本申请实施例提供的一种实施环境的示意图。该实施环境包括源装置10、目的地装置20、链路30和存储装置40。其中,源装置10可以产生经编码的视频,即码流。因此,源装置10也可以被称为编码装置。目的地装置20可以对由源装置10所产生码流进行解码。因此,目的地装置20也可以被称为解码装置。链路30可以接收源装置10所产生的经编码的视频,并可以将该经编码的视频传输给目的地装置20。存储装置40可以接收源装置10所产生的经编码的视频,并可以将该经编码的视频进行存储,这样的条件下,目的地装置20可以直接从存储装置40中获取经编码的视频。或者,存储装置40可以对应于文件服务器或可以保存由源装置10产生的经编码的视频的另一中间存储装置,这样的条件下,目的地装置20可以经由流式传输或下载存储装置40存储的经编码的视频。
源装置10和目的地装置20均可以包括一个或多个处理器以及耦合到该一个或多个处理器的存储器,该存储器可以包括随机存取存储器(random access memory,RAM)、只读存储器(read-only memory,ROM)、带电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、快闪存储器、可用于以可由计算机存取的指令或数据结构的形式存储所要的程序代码的任何其它媒体等。例如,源装置10和目的地装置20均可以包括手机、智能手机、个人数字助手(personal digital assistant,PDA)、可穿戴设备、掌上电脑(pocket PC,PPC)、平板电脑、智能车机、智能电视、智能音箱、桌上型计算机、移动计算装置、笔记型(例如,膝上型)计算机、平板计算机、机顶盒、例如所谓的“智能”电话等电话手持机、电视机、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机或其类似者。
链路30可以包括能够将经编码的视频从源装置10传输到目的地装置20的一个或多个媒体或装置。在一种可能的实现方式中,链路30可以包括能够使源装置10实时地将经编码的视频直接发送到目的地装置20的一个或多个通信媒体。在本申请实施例中,源装置10可以基于通信标准来调制经编码的视频,该通信标准可以为无线通信协议等,并且可以将经调制的视频发送给目的地装置20。该一个或多个通信媒体可以包括无线和/或有线通信媒体,例如该一个或多个通信媒体可以包括射频(radio frequency,RF)频谱或一个或多个物理传输线。该一个或多个通信媒体可以形成基于分组的网络的一部分,基于分组的网络可以为局域网、 广域网或全球网络(例如,因特网)等。该一个或多个通信媒体可以包括路由器、交换器、基站或促进从源装置10到目的地装置20的通信的其它设备等,本申请实施例对此不做具体限定。
在一种可能的实现方式中,存储装置40可以将接收到的由源装置10发送的经编码的视频进行存储,目的地装置20可以直接从存储装置40中获取经编码的视频。这样的条件下,存储装置40可以包括多种分布式或本地存取的数据存储媒体中的任一者,例如,该多种分布式或本地存取的数据存储媒体中的任一者可以为硬盘驱动器、蓝光光盘、数字多功能光盘(digital versatile disc,DVD)、只读光盘(compact disc read-only memory,CD-ROM)、快闪存储器、易失性或非易失性存储器,或用于存储码流的任何其它合适的数字存储媒体等。
在一种可能的实现方式中,存储装置40可以对应于文件服务器或可以保存由源装置10产生的码流的另一中间存储装置,目的地装置20可经由流式传输或下载存储装置40存储的图像。文件服务器可以为能够存储经编码的视频并且将经编码的视频发送给目的地装置20的任意类型的服务器。在一种可能的实现方式中,文件服务器可以包括网络服务器、文件传输协议(file transfer protocol,FTP)服务器、网络附属存储(network attached storage,NAS)装置或本地磁盘驱动器等。目的地装置20可以通过任意标准数据连接(包括因特网连接)来获取经编码图像。任意标准数据连接可以包括无线信道(例如,Wi-Fi连接)、有线连接(例如,数字用户线路(digital subscriber line,DSL)、电缆调制解调器等),或适合于获取存储在文件服务器上的经编码的视频的两者的组合。经编码的视频从存储装置40的传输可为流式传输、下载传输或两者的组合。
图1所示的实施环境仅为一种可能的实现方式,并且本申请实施例的技术不仅可以适用于图1所示的可以对图像进行编码的源装置10,以及可以对经编码的视频进行解码的目的地装置20,还可以适用于其他可以对视频进行编码和对码流进行解码的装置,本申请实施例对此不做具体限定。
在图1所示的实施环境中,源装置10包括数据源120、编码器100和输出接口140。在一些实施例中,输出接口140可以包括调节器/解调器(调制解调器)和/或发送器,其中发送器也可以称为发射器。数据源120可以包括视频捕获装置(例如,摄像机等)、含有先前捕获的视频的存档、用于从视频内容提供者接收视频的馈入接口,和/或用于产生视频的计算机图形系统,或视频的这些来源的组合。
数据源120可以向编码器100发送视频,编码器100可以对接收到由数据源120发送的视频进行编码,得到经编码的视频。编码器可以将经编码的视频发送给输出接口。在一些实施例中,源装置10经由输出接口140将经编码的视频直接发送到目的地装置20。在其它实施例中,经编码的视频还可存储到存储装置40上,供目的地装置20以后获取并用于解码和/或显示。
在图1所示的实施环境中,目的地装置20包括输入接口240、解码器200和显示装置220。在一些实施例中,输入接口240包括接收器和/或调制解调器。输入接口240可经由链路30和/或从存储装置40接收经编码的视频,然后再发送给解码器200,解码器200可以对接收到的经编码的视频进行解码,得到经解码的视频。解码器可以将经解码的视频发送给显示装置220。显示装置220可与目的地装置20集成或可在目的地装置20外部。一般来说,显示装置220显示经解码的视频。显示装置220可以为多种类型中的任一种类型的显示装置,例如, 显示装置220可以为液晶显示器(liquid crystal display,LCD)、等离子显示器、有机发光二极管(organic light-emitting diode,OLED)显示器或其它类型的显示装置。
尽管图1中未示出,但在一些方面,编码器100和解码器200可各自与编码器和解码器集成,且可以包括适当的多路复用器-多路分用器(multiplexer-demultiplexer,MUX-DEMUX)单元或其它硬件和软件,用于共同数据流或单独数据流中的音频和视频两者的编码。在一些实施例中,如果适用的话,那么MUX-DEMUX单元可符合ITU H.223多路复用器协议,或例如用户数据报协议(user datagram protocol,UDP)等其它协议。
编码器100和解码器200各自可为以下各项电路中的任一者:一个或多个微处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件或其任何组合。如果部分地以软件来实施本申请实施例的技术,那么装置可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令从而实施本申请实施例的技术。前述内容(包括硬件、软件、硬件与软件的组合等)中的任一者可被视为一个或多个处理器。编码器100和解码器200中的每一者都可以包括在一个或多个编码器或解码器中,所述编码器或所述解码器中的任一者可以集成为相应装置中的组合编码器/解码器(编码解码器)的一部分。
本申请实施例可大体上将编码器100称为将某些信息“发信号通知”或“发送”到例如解码器200的另一装置。术语“发信号通知”或“发送”可大体上指代用于对经压缩的视频进行解码的语法元素和/或其它数据的传送。此传送可实时或几乎实时地发生。替代地,此通信可经过一段时间后发生,例如可在编码时在经编码位流中将语法元素存储到计算机可读存储媒体时发生,解码装置接着可在将语法元素存储到此媒体之后的任何时间检索该语法元素。
图2是本申请实施例提供的另一种实施环境的示意图。该实施环境包括编码端和解码端,编码端包括AI编码模块、熵编码模块和文件发送模块,解码端包括文件加载模块、熵解码模块和AI解码模块。
在压缩过程中,编码端获取待压缩的视频后,经AI编码单元得到待编码的帧间运动特征和残差特征,将帧间运动特征和残差特征进行熵编码,以得到码流,即视频的压缩文件。编码端保存压缩文件。另外,压缩文件传输到解码端,解码端加载压缩文件,并通过熵解码和AI解码单元得到解压后的视频。
可选地,AI编码单元包括下文中的图像特征提取网络、运动编码网络、残差编码网络、熵估计网络、运动解码网络或残差解码网络等中的一种或多种,AI解码单元包括下文中的运动解码网络、残差解码网络或熵估计网络等中的一种或多种。
可选地,AI编码单元和AI解码单元处理数据的过程在嵌入式的神经网络处理器(neural network processing unit,NPU)上实现,以提高数据处理效率,熵编码、保存文件以及加载等过程在中央处理器(central processing unit,CPU)上实现。
可选地,编码端和解码端为一台设备,或者,编码端和解码端为两台独立的设备。也即是,对于一台设备来说,该设备既具备视频压缩功能,也具备视频解压功能,或者,该设备具备视频压缩功能或视频解压功能。
需要说明的是,本申请实施例提供的编解码方法可以应用于多种场景,比如云存储、视频监控、直播、传输等业务场景中,具体可以应用到终端录像、视频相册、云存储等中。结合图1和图2所示的实施环境,下文中的任一种编码方法可以是编码端执行的。下文中的任一种解码方法可以是解码端执行的。
图3是本申请实施例提供的一种编码方法的流程图,该方法应用于编码端。请参考图3,该方法包括如下步骤。
步骤301:确定当前特征和参考特征,当前特征为待编码的当前图像的特征,参考特征为当前图像的参考图像的特征。
在本申请实施例中,对于帧间预测编码来说,待编码的当前图像对应有参考图像。例如,P帧的图像对应有一个参考图像,该参考图像为该P帧之前的一个I帧或P帧的图像。又如,B帧的图像对应有两个参考图像,即该B帧之前的一个I帧或P帧的图像,以及该B帧之后的一个P帧的图像。接下来以P帧为例进行介绍。
在进行帧间预测编码的过程中,编码端确定当前特征和参考特征的一种实现方式为:将当前图像输入图像特征提取网络,以得到当前特征,将参考图像输入图像特征提取网络,以得到参考特征。其中,当前特征为待编码的当前图像的特征,参考特征为当前图像的参考图像的特征。除了通过图像特征提取网络来提取图像的特征之外,编码端也可以通过其他实现方式来提取图像的特征,例如主成分分析、基于统计的方法等。
需要说明的是,本申请实施例中的图像特征提取网络是预先训练得到的,本申请实施例中不限定图像特征提取网络的网络结构和训练方式等。例如,图像特征提取网络可以基于全连接网络或卷积神经网络所构建的网络,卷积神经网络中的卷积可以是2D卷积或3D卷积。另外,本申请实施例对图像特征提取网络所包括的网络层数和每一层的节点数也不作限定。在一具体实现中,图像特征提取网络为基于Resblock所构建的网络。
可选地,参考图像为参考帧的重建图像。其中,该参考帧为待编码的当前帧的参考帧,待编码的当前帧的原始图像即当前图像。参考帧的重建图像为根据本申请实施例提供的编码方法对参考帧的原始图像进行压缩后再解压得到的图像。在另一些实施例中,参考图像为参考帧的原始图像。
步骤302:确定参考特征相对于当前特征的相关性矩阵。
为了提高对帧间运动特征的预测精度,在方案引入相关性矩阵。在本申请实施例中,编码端确定参考特征相对于当前特征的相关性矩阵。相关性矩阵的计算方式请参见上文介绍。
步骤303:基于该相关性矩阵,确定帧间运动特征。
需要说明的是,编码端基于参考特征相对于当前特征的相关性矩阵来确定帧间运动特征的实现方式有多种,接下来将对其中的几种实现方式进行详细介绍。
在第一种实现方式中,编码端将该相关性矩阵、当前特征和参考特征输入运动编码网络,以得到帧间运动特征。在第二种实现方式中,编码端将该相关性矩阵、当前图像和参考图像输入运动编码网络,以得到帧间运动特征。在第三种实现方式中,编码端将该相关性矩阵输入运动编码网络,以得到帧间运动特征。
在第四种实现方式中,编码端将参考特征作为预测特征,将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征。编码端确定迭代次数,如果该迭代次数小于迭代次数阈值,则编码端将该运动特征输入运动解码网络,以得到重建运动特征,基于该重 建运动特征,对参考特征进行变换,以重新确定预测特征,重新确定该预测特征相对于当前特征的相关性矩阵,返回执行将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征的步骤。如果该迭代次数等于迭代次数阈值,则编码端将该运动特征确定为帧间运动特征。
其中,第一次迭代处理的过程中,迭代次数等于初始值,最后一次迭代处理的过程中,迭代次数等于迭代次数阈值。可选地,初始值为0,且迭代次数阈值为K-1,或者,初始值为1,且迭代次数阈值为K。其中,K为大于或等于1的正整数,K表示迭代处理的总次数。
示例性地,假设迭代处理的总次数为K,初始值为1,迭代次数阈值为K,待编码的当前图像xt为视频中的第t帧,t>0,当前图像xt的参考图像为参考帧的重建图像当前特征为Ft,参考特征为参考特征相对于当前特征Ft的相关性矩阵为Ct。编码过程中迭代处理的过程如下:
在第i次迭代处理的过程中,将相关性矩阵预测特征和当前特征Ft输入运动编码网络,以得到运动特征其中,当i=1时,预测特征等于参考特征相关性矩阵等于相关性矩阵Ct。判断i是否小于K。如果i小于K,则将运动特征输入运动解码网络,以得到重建运动特征基于重建运动特征对参考特征进行变换,以得到预测特征确定预测特征相对于当前特征Ft的相关性矩阵然后,执行第i+1次迭代处理。如果i=K,则将运动特征作为帧间运动特征。结束迭代处理的流程。
在第五种实现方式中,编码端将参考图像作为预测图像,将该相关性矩阵、预测图像和当前图像输入运动编码网络,以得到运动特征。编码端确定迭代次数,如果该迭代次数小于迭代次数阈值,则编码端将该运动特征输入运动解码网络,以得到重建运动特征,基于该重建运动特征,对参考图像进行变换,以重新确定预测图像,确定预测特征,即预测图像的特征,重新确定该预测特征相对于当前特征的相关性矩阵,返回执行将该相关性矩阵、预测图像和当前图像输入运动编码网络,以得到运动特征的步骤。如果该迭代次数等于迭代次数阈值,则编码端将该运动特征确定为帧间运动特征。
由上述可知,在上述第四和第五种实现方式中,编码端通过多次迭代的方式来提高帧间运动特征的预测精度,换句话说,通过对运动特征的迭代更新,更加丰富了运动细节。而在上述第一至第三种实现方式中,编码端相当于通过一次迭代来确定帧间运动特征,这样能够节省编解码的时间。
需要说明的是,帧间运动特征的预测精度越高,重建运动特征所表征的运动细节也即越丰富。在本申请实施例中,重建运动特征也称为重建的运动信息。图4是本申请实施例提供的一种重建运动特征的对比图。图4是关于某视频中同一图像的重建运动特征,第一列是第一次迭代处理后得到的重建运动特征,第二列是第二次迭代处理后得到的重建运动特征。在图4中用椭圆圈出了一部分对比明显的区域,可以看出,第二列的边缘结构更加清晰明确、精度更高,换句话说,第二列的重建运动特征相比于第一列的重建运动特征具有更明显的细节信息。
另外,本申请实施例中的运动编码网络和运动解码网络是预先训练得到的,本申请实施例中不限定运动编码网络和运动解码网络的网络结构和训练方式等。例如,运动编码网络和运动解码网络均可以是全连接网络或卷积神经网络,卷积神经网络中的卷积可以是2D卷积或3D卷积。另外,本申请实施例对运动编码网络和运动解码网络所包括的网络层数和每一 层的节点数也不作限定。
图5是本申请实施例提供的一种编码网络的结构示意图。该编码网络可以为运动编码网络。参见图5,该编码网络为卷积神经网络,该卷积神经网络包括四个卷积层(Conv)和穿插级联的三个抓取检测网络(grasp detection network,GDN)层。每个卷积层的卷积核大小均为5×5,输出的特征图的通道数为M,每个卷积层对宽和高进行2倍下采样。需要说明的是,图5所示编码网络的结构并不用于限制本申请实施例,例如,卷积核大小、特征图的通道数、下采样倍数、下采样次数、卷积层数等均可调整。
图6是本申请实施例提供的一种解码网络的结构示意图。该解码网络可以为运动解码网络。参见图6,该解码网络为卷积神经网络,该卷积神经网络包括四个卷积层(Conv)和穿插级联的三个抓取检测网络(GDN)层。每个卷积层的卷积核大小均为5×5,输出的特征图的通道数为M或N,每个卷积层对宽和高进行2倍上采样。需要说明的是,图6所示解码网络的结构并不用于限制本申请实施例,例如,卷积核大小、特征图的通道数、下采样倍数、下采样次数、卷积层数等均可调整。
步骤304:将该帧间运动特征编入码流。
在本申请实施例中,编码端将该帧间运动特征编入码流,以便于后续解码端基于码流中的帧间运动特征来解压视频。
可选地,编码端通过熵编码将该帧间运动特征编入码流。在一种实现方式中,编码端根据指定的第一概率分布参数,通过熵编码将该帧间运动特征编入码流。在另一种实现方式中,编码端将该帧间运动特征输入超编码网络(也可称为超先验网络),以得到第一超先验特征。编码端根据指定的第二概率分布参数,通过熵编码将第一超先验特征编入码流。另外,编码端将第一超先验特征(从码流中解析出的第一超先验特征或者通过超编码网络得到的第一超先验特征)输入超解码网络,得到第一先验特征。编码端基于第一先验特征确定该帧间运动特征的概率分布参数,基于该帧间运动特征的概率分布参数,通过熵编码将该帧间运动特征编入码流。需要说明的是,编码端将第一超先验特征编入码流是为了解码端基于第一超先验特征从码流中解析出帧间运动特征。
其中,指定的第一概率分布参数和第二概率分布参数均为预先通过相应的概率分布估计网络确定的概率分布参数,本申请实施例不限定用于所采用的概率分布估计网络的网络结构和训练方法。例如,概率分布估计网络的网络结构可以为全连接网络或者CNN。另外,本申请实施例对概率分布估计网络的网络结构所包含的层数和每一层的节点数也不做限定。
以上介绍了编码端确定帧间运动特征以及编码帧间运动特征的实现过程。需要说明的是,对于帧间预测编码来说,编码端除了确定并编码帧间运动特征之外,还确定残差特征并编码残差特征,以便于解码端基于帧间运动特征和残差特征来解压视频。
也即是,参见图7,本申请实施例提供的编码方法还包括如下步骤305和步骤306。
步骤305:基于该帧间运动特征,确定残差特征。
在本申请实施例中,编码端基于该帧间运动特征来确定残差特征。
在一种实现方式中,编码端将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征。编码端基于当前图像与参考图像之间的重建运动特征,对参考特征进行变换,以得到当前图像的预测特征。编码端确定第一残差,第一残差为当前图像的预测特征与当前特征之间的残差,并将第一残差输入残差编码网络,以得到残差特征。也即 是,编码端在特征空间进行变换和预测。
在另一种实现方式中,编码端将帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征。编码端基于当前图像与参考图像之间的重建运动特征,对参考图像进行变换,以得到预测图像。编码端确定第二残差,第二残差为预测图像与当前图像之间的残差,并将第二残差输入残差编码网络,以得到残差特征。其中,预测图像是当前图像的预测图像。也即是,编码端在图像空间进行变换和预测。
需要说明的是,步骤305中的运动解码网络与步骤303中的运动解码网络为同一个。步骤305中的残差编码网络是预先训练得到的,本申请实施例中不限定残差编码网络的网络结构和训练方式等。例如,残差编码网络均可以是全连接网络或卷积神经网络,卷积神经网络中的卷积可以是2D卷积或3D卷积。另外,本申请实施例对残差编码网络所包括的网络层数和每一层的节点数也不作限定。可选地,残差编码网络的网络结构也如图5所示的网络结构。
步骤306:将该残差特征编入码流。
在本申请实施例中,编码端将该残差特征编入码流,以便于后续解码端基于码流中帧间运动特征与残差特征来解压视频。
可选地,编码端通过熵编码将该残差特征编入码流。在一种实现方式中,编码端根据指定的第三概率分布参数,通过熵编码将该残差特征编入码流。在另一种实现方式中,编码端将该残差特征输入超编码网络,以得到第二超先验特征。编码端根据指定的第四概率分布参数,通过熵编码将第二超先验特征编入码流。另外,编码端将第二超先验特征(从码流中解析出的第二超先验特征或者通过超编码网络得到的第二超先验特征)输入超解码网络,得到第二先验特征。编码端基于第二先验特征确定该残差特征的概率分布参数,基于该残差特征的概率分布参数,通过熵编码将该残差特征编入码流。需要说明的是,编码端将第二超先验特征编入码流是为了解码端基于第二超先验特征从码流中解析出残差特征。
其中,指定的第三概率分布参数和第四概率分布参数均为预先通过相应的概率分布估计网络确定的概率分布参数,本申请实施例不限定用于所采用的概率分布估计网络的网络结构和训练方法。例如,概率分布估计网络的网络结构可以为全连接网络或者CNN。本申请实施例对概率分布估计网络的网络结构所包含的层数和每一层的节点数也不做限定。另外,编码残差特征所采用的超编码网络与编码帧间运动特征所采用的超编码网络相同或不同,编码残差特征所采用的超解码网络与编码帧间运动特征所采用的超解码网络相同或不同。
需要说明的是,若上述任一概率分布估计网络使用高斯模型(如单高斯模型或混合高斯模型)来建模,则估计出的概率分布参数包括均值和方差。例如,假设待估计概率分布参数的残差特征符合单高斯模型或混合高斯模型,那么,该概率分布估计网络所得到的残差特征的概率分布参数包括均值和方差。若上述任一概率分布估计网络使用拉普拉斯分布模型来建模,则估计出的概率分布参数包括位置参数和尺度参数。若上述任一概率分布估计网络使用逻辑斯谛分布模型来建模,则估计出的概率分布参数包括均值和尺度参数。另外,本申请实施例中的概率分布估计网络也可称为因子熵模型,概率分布估计网络是熵估计网络的一部分,熵估计网络还包括上述超编码网络和超解码网络。例如,编码帧间运动特征所采用的超编码网络、概率分布估计网络和超解码网络组成一个熵估计网络的部分或全部,编码残差特征所采用的超编码网络、概率分布估计网络和超解码网络组成另一个熵估计网络的部分或全部。
图8是本申请实施例提供的一种熵估计网络的结构示意图。该熵估计网络可以为上述任 一熵估计网络。参见图8,该熵估计网络包括超编码(hyper encoder,HyEnc)网络、因子熵模型和超解码(hyper decoder,HyDec)网络。超编码网络包括三个卷积层(Conv)和穿插级联的两个激活层(如基于Relu或其他激活函数构建的激活层)。每个卷积层的卷积核大小均为5×5,输出的特征图的通道数为M,前两个卷积层对宽和高进行2倍下采样,最后一个卷积层不进行下采样。因子熵模型的网络结构如前述介绍的概率分布估计网络的网络结构。超编码网络包括三个卷积层(Conv)和穿插级联的两个激活层(如基于Relu或其他激活函数构建的激活层)。每个卷积层的卷积核大小均为5×5,输出的特征图的通道数为M,第一个卷积层不进行上采样,后两个卷积层对宽和高进行2倍上采样。需要说明的是,图8所示熵估计网络的结构并不用于限制本申请实施例,例如,卷积核大小、特征图的通道数、下采样倍数、下采样次数、上采样倍数、上采样次数、卷积层数等均可调整。
由前述步骤305和步骤306可知,在本申请实施例中,编码端先得到残差(第一残差或第二残差),再得到残差特征,进而编码残差特征,相当于对残差进行了压缩。在另一些实施例中,编码端也可以在得到残差后,直接将残差编入码流,也即是不压缩残差。
图9是本申请实施例提供的一种视频编解码方法的流程图。在图9中,假设待编码的当前图像xt为视频中的第t帧,t>0,当前图像xt的参考图像为参考帧的重建图像对当前图像xt进行编码的编码过程包括如下步骤901至步骤910。
步骤901:将当前图像xt和参考图像分别输入图像特征提取网络,以得到当前特征Ft和参考特征当前特征Ft和参考特征的维度均为c1*h1*w1。
步骤902:计算参考特征相对于当前特征Ft的相关性矩阵Ct。相关性矩阵Ct的维度为k*k*h1*w1。
步骤903:将相关性矩阵Ct、当前特征Ft和参考特征输入运动编码网络,以得到帧间运动特征该帧间运动特征为待编码的运动特征。可选地,帧间运动特征的维度为c1*h2*w2,通常h2<h1,w2<w1。
步骤904:通过熵估计网络确定帧间运动特征中各个元素对应的概率分布参数,如均值μm,t和方差σm,t
步骤905:基于帧间运动特征中各个元素对应的概率分布参数,通过熵编码将帧间运动特征编入码流。可选地,码流为比特流,对帧间运动特征进行熵编码所得到的比特序列是码流包括的部分比特序列,这部分比特序列称为运动信息码流或运动信息比特流。
步骤906:将帧间运动特征输入运动解码网络,以得到重建运动特征Mt。可选地,重建运动特征Mt的维度为c2*h1*w1,通常c2<c1。在一些实施例中,c2可能大于或等于c1。
步骤907:利用重建运动特征Mt,将参考特征变换为预测特征预测特征的维度为c1*h1*w1。
步骤908:计算当前特征Ft与预测特征之间的残差,将该残差输入残差编码网络,以得到残差特征该残差特征为待编码的残差特征。
步骤909:通过熵估计网络确定残差特征中各个元素对应的概率分布参数,如均值μr,t和方差σm,t
步骤910:基于残差特征中各个元素对应的概率分布参数,通过熵编码将残差特征编入码流。假设码流为比特流,对残差特征进行熵编码所得到的比特序列是码流包括的部分比特序列,这部分比特序列称为残差信息码流或运动信息比特流。
可选地,在编码当前图像之后,若后续的编码过程还需要将当前图像xt作为某待编码图像的参考图像,则编码端还将残差特征输入残差解码网络,以得到重建的残差。编码端基于预测特征和重建的残差,得到当前图像xt的重建特征,将当前图像xt的重建特征输入图像重建网络,以得到当前帧的重建图像
需要说明的是,图像重建网络在图9中未示出,图像重建网络可以是一种反卷积网络,图像重建网络可以与上述图像特征提取网络是相匹配的。另外,残差解码网络是预先训练得到的,本申请实施例中不限定残差解码网络的网络结构和训练方式等。例如,残差解码网络可以是全连接网络或卷积神经网络。另外,本申请实施例对残差解码网络所包括的网络层数和每一层的节点数也不作限定。残差解码网络的网络结构可以如图6所示的网络结构。
另外,结合图9中所示的虚线部分,可以将上述步骤903替换为:将相关性矩阵Ct、当前图像xt和参考图像输入运动编码网络,以得到帧间运动特征从而得到另一实施例。也可以将上述步骤907替换为:利用重建运动特征Mt,将参考图像变换为预测图像将步骤908替换为:计算当前图像xt与预测图像之间的残差,将该残差输入残差编码网络,以得到残差特征从而得到又一实施例,且在该实施例中,后续得到当前帧的重建图像的步骤与上段中所介绍的步骤的不同在于:基于预测图像和重建的残差,得到当前帧的重建图像
值得注意的是,在图9所示视频编解码方法的流程图中,未通过多次迭代来得到帧间运动特征。若想要细节更加丰富的重建运动特征,可以将上述步骤303中所介绍的迭代流程(如下述图10所示迭代流程)运用到图9所示的视频编解码方法中,从而通过多次迭代得到精度更高的帧间运动特征。
图10是本申请实施例提供的一种编码方法中的部分流程图。在图10中,假设迭代处理的总次数为K,初始值为0,迭代次数阈值为K-1,待编码的当前图像xt为视频中的第t帧,t>0,当前图像xt的参考图像为参考帧的重建图像对当前图像xt进行编码的编码过程包括如下步骤1001至步骤1010。
步骤1001:将当前图像xt和参考图像分别输入图像特征提取网络,以得到当前特征Ft和参考特征
步骤1002:计算参考特征相对于当前特征Ft的相关性矩阵Ct
步骤1003:将参考特征作为首次迭代处理的预测特征将相关性矩阵Ct作为预测特征相对于当前特征Ft的相关性矩阵。
步骤1004:将相关性矩阵Ct、预测特征和当前特征Ft输入运动编码网络,以得到运动特征并确定迭代次数i。其中,首次确定的迭代次数i为0,之后每次确定的迭代次数等于上一次确定的迭代次数加1。
步骤1005:判断i是否小于K-1。如果i小于K-1,则执行步骤1006。如果i=K-1,则将运动特征作为帧间运动特征,执行步骤1009。
步骤1006:将运动特征输入运动解码网络,以得到重建运动特征Mt
步骤1007:基于重建运动特征Mt,对参考特征进行变换,以重新预测特征
步骤1008:重新确定预测特征相对于当前特征Ft的相关性矩阵Ct,返回执行步骤1004。
步骤1009:通过熵估计网络确定帧间运动特征中各个元素对应的概率分布参数。
步骤1010:基于帧间运动特征中各个元素对应的概率分布参数,通过熵编码将帧间运 动特征编入码流。
步骤1011:将帧间运动特征输入运动解码网络,以得到重建运动特征Mt
步骤1012:利用重建运动特征Mt,将参考特征变换为预测特征
步骤1013:计算当前特征Ft与预测特征之间的残差,将该残差输入残差编码网络,以得到残差特征
步骤1014:通过熵估计网络确定残差特征中各个元素对应的概率分布参数。
步骤1015:基于残差特征中各个元素对应的概率分布参数,通过熵编码将残差特征编入码流。
图11是本申请实施例提供的另一种重建运动特征的对比图。图11是关于某视频中同一图像的重建运动特征,第一行是上述图9所示的视频编解码方法中所得到的重建运动特征,第二行是相关技术中所得到的重建运动特征,该相关技术是将当前帧和参考帧的图像特征输入卷积神经网络,完全依赖卷积运算来拟合当前帧与参考帧之间的运动。可以看出,第一行的边缘结构更加清晰明确、精度更高,换句话说,第一行的重建运动特征相比于第二行的重建运动特征具有更明显的细节信息。
由前述可知,本方案既能够应用于P帧,又能够应用于B帧,前面所介绍的编码过程是以P帧为例。对于B帧来说,当前图像对应有两个参考图像,在一实现方式中,编码端基于这两个参考图像,按照前述方法得到两个帧间运动特征和两个残差特征,将这两个帧间运动特征和这两个残差特征编入码流。其中,这两个帧间运动特征分别对应这两个参考图像,这两个残差特征也分别对应这两个参考图像。在另一实现方式中,编码端基于这两个参考图像,按照前述方法得到两个帧间运动特征,将这两个帧间运动特征编入码流。编码端基于这两个帧间运动特征,对参考特征进行变换,以分别得到两个预测特征,融合这两个预测特征,得到一个融合预测特征,基于融合预测特征和当前特征,得到一个残差特征,将该残差特征编入码流。
为了验证本方案的编解码性能,本申请实施例还在测试集上对本方案以及两个对比方案进行了测试。该测试集包括联合视频专家组(joint video experts team,JVET)标准测试集中的B类、C类和D类视频,以及YUV_CTC视频集中的一些视频。其中,B类视频的分辨率为1920*1080,C类视频的分辨率为832*480,D类视频的分辨率为416*240。图12至图15是本申请实施例提供的编解码性能对比图。其中,图12至图14分别是针对B类、C类和D类视频的测试结果,图15是针对YUV_CTC视频集中视频的测试结果。图12至图15的图例中“Corr”代表本方案,即基于相关性矩阵的帧间预测编码方案,图例中的“光流”代表对比方案1,即基于光流估计的帧间预测编码方案,图例中的“FVC”代表对比方案2,即仅基于当前帧和参考帧的特征得到帧间运动特征的编码方案。由于性能指标PSNR越高,重建的图像质量越好,BPP越大,压缩率越低,重建的图像质量越低。可以看出,本方案所对应曲线更靠上和靠左,说明本方案的编解码性能相对更好,即压缩性能更优。
由前述可知,本方案不依赖于光流,减少了计算光流的计算量,且帧间运动特征的预测更加容易,且更容易预测出更有利于压缩的帧间运动特征。本方案引入相关性矩阵,有利于提高帧间运动特征的预测精度,使得帧间预测和压缩变得更加简单,且利于提升整个编解码模型的拟合和泛化能力。
综上所述,在本申请实施例中,通过引入相关性矩阵来拟合帧间运动,由于相关性矩阵 能够表征当前特征与参考特征之间相关性较强和较弱的部分,相关性较强的部分所对应的帧间运动信息更加丰富,因此,在拟合帧间运动的过程中,基于相关性矩阵中各个元素的大小,便能够更好地拟合出相关性较强的部分对应的帧间运动,而较少地关注相关性较弱的部分对应的帧间运动。简单来说,相关性矩阵对帧间运动特征的预测有信息增强的作用,即,能够提高帧间运动特征的预测精度,进而提升压缩性能。
图16是本申请实施例提供的一种解码方法的流程图。该方法应用于解码端。参见图16,该方法包括如下步骤。
步骤1601:从码流中解析出帧间运动特征和残差特征。
其中,编入码流的帧间运动特征是基于参考特征相对于当前特征的相关性矩阵确定的。
可选地,解码端通过熵解码从码流中解析出该帧间运动特征。在一种实现方式中,解码端根据指定的第一概率分布参数,通过熵解码从码流中解析出该帧间运动特征。在另一种实现方式中,解码端根据指定的第二概率分布参数,从码流中解析出第一超先验特征,将第一超先验特征输入超解码网络,得到第一先验特征。解码端基于第一先验特征确定该帧间运动特征的概率分布参数,基于该帧间运动特征的概率分布参数,从码流中解析出该帧间运动特征。
可选地,解码端通过熵解码从码流中解析出该残差特征。在一种实现方式中,解码端根据指定的第三概率分布参数,通过熵解码从码流中解析出该残差特征。在另一种实现方式中,解码端根据指定的第四概率分布参数,从码流中解析出第二超先验特征,将第二超先验特征输入超解码网络,得到第二先验特征。解码端基于第二先验特征确定该残差特征的概率分布参数,基于该残差特征的概率分布参数,通过熵解码从码流中解析出该残差特征。
需要说明的是,解码过程中的第一概率分布参数、第二概率分布参数、第三概率分布参数和第四概率分布参数与编码过程中的相同,解码过程中所采用的超解码网络与编码过程中的超解码网络相同。
步骤1602:基于该帧间运动特征和参考特征,确定待解码的当前图像的预测特征。
在本申请实施例中,解码端将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征。解码端基于该重建运动特征,对参考特征进行变换,以得到当前图像的预测特征。具有实现过程与编码过程中的相关内容一致,这里不再赘述。
步骤1603:基于该预测特征和残差特征,重建出当前图像。
在本申请实施例中,解码端基于该残差特征和预测特征,得到当前图像的重建特征。解码端将当前图像的重建特征输入图像重建网络,以重建出当前图像,即得到当前帧的重建图像。其中,解码端可以将该残差特征输入残差解码网络,以得到重建的残差,基于该预测特征和重建的残差,得到当前图像的重建特征。具体实现过程与上述图9实施例中的相关内容一致,这里不再赘述。
在另一实施例的解码过程中,将上述步骤1602和步骤1603替换为:基于该帧间运动特征和参考图像,确定待解码的当前图像的预测图像,基于该预测图像和残差特征,重建出当前特征。其中,解码端可以将该帧间运动特征输入运动解码网络,以得到重建运动特征。解码端基于该重建运动特征,对参考图像进行变换,以得到当前图像的预测图像。解码端可以将该残差特征输入残差解码网络,以得到重建的残差,基于该预测图像和重建的残差,重建 出当前图像。
在又一实施例的解码过程中,解码端从码流中解析出帧间运动特征和残差,将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征。解码端基于该重建运动特征,对参考特征进行变换,以得到当前图像的预测特征。解码端基于该预测特征和解析出的残差,得到当前图像的重建特征,将当前图像的重建特征输入图像重建网络,以重建出当前图像。
在又一实施例的解码过程中,解码端从码流中解析出帧间运动特征和残差,将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征。解码端基于该重建运动特征,对参考图像进行变换,以得到当前图像的预测图像。解码端基于该预测图像和解析出的残差,重建出当前图像。
需要说明的是,由前述可知,本方案既能够应用于P帧,又能够应用于B帧。对于P帧来说,解码端从码流中解析出一个帧间运动特征和一个残差特征,从而解码P帧。对于B帧来说,当前图像对应有两个参考图像,在一实现方式中,解码端从码流中解析出两个帧间运动特征和两个残差特征,解码端基于这两个帧间运动特征得到两个预测特征,基于这两个残差特征和这两个预测特征,得到当前帧的两个重建图像,融合这两个重建图像,重建出当前图像。在另一实现方式中,解码端从码流中解析出两个帧间运动特征和一个残差特征,解码端基于这两个帧间运动特征,得到两个预测特征,融合这两个预测特征,得到一个融合预测特征,基于融合预测特征和残差特征,重建出当前图像。
还需要说明的是,上述任一实施例中的解码过程与编码过程是相匹配的,例如,若编码过程是在图像空间作变换和预测,那么解码过程也是在图像空间作变换和预测。若编码过程是在特征空间作变换和预测,那么解码过程也是在特征空间作变换和预测。
综上所述,在本申请实施例的编码过程中,通过引入相关性矩阵来拟合帧间运动。相关性矩阵对帧间运动特征的预测有信息增强的作用,即,能够提高帧间运动特征的预测精度,进而提升压缩性能。
图17是本申请实施例提供的一种编码装置1700的结构示意图,该编码装置1700可以由软件、硬件或者两者的结合实现成为计算机设备的部分或者全部,该计算机设备可以包括上述实施例中的任一编码端。参见图17,该装置1700包括:第一确定模块1701、第二确定模块1702、第三确定模块1703和第一编码模块1704。
第一确定模块1701,用于确定当前特征和参考特征,当前特征为待编码的当前图像的特征,参考特征为当前图像的参考图像的特征;
第二确定模块1702,用于确定参考特征相对于当前特征的相关性矩阵;
第三确定模块1703,用于基于该相关性矩阵,确定帧间运动特征;
第一编码模块1704,用于将该帧间运动特征编入码流。
可选地,第三确定模块1703用于:
将该相关性矩阵输入运动编码网络,以得到帧间运动特征;或者,
将该相关性矩阵、当前特征和参考特征输入运动编码网络,以得到帧间运动特征;或者,
将该相关性矩阵、当前图像和参考图像输入运动编码网络,以得到帧间运动特征。
可选地,第三确定模块1703用于:
将参考特征作为预测特征,将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征;
确定迭代次数;
如果迭代次数小于迭代次数阈值,则将该运动特征输入运动解码网络,以得到重建运动特征,基于该重建运动特征,对参考特征进行变换,以重新确定预测特征,重新确定预测特征相对于当前特征的相关性矩阵,返回执行将该相关性矩阵、预测特征和当前特征输入运动编码网络,以得到运动特征的步骤;
如果迭代次数等于迭代次数阈值,则将该运动特征确定为帧间运动特征。
可选地,该装置1700还包括:
第四确定模块,用于基于该帧间运动特征,确定残差特征;
第二编码模块,用于将该残差特征编入码流。
可选地,第四确定模块用于:
将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征;
基于当前图像与参考图像之间的重建运动特征,对参考特征进行变换,以得到当前图像的预测特征;
确定第一残差,第一残差为当前图像的预测特征与当前特征之间的残差;
将第一残差输入残差编码网络,以得到该残差特征。
可选地,第四确定模块用于:
将该帧间运动特征输入运动解码网络,以得到当前图像与参考图像之间的重建运动特征;
基于当前图像与参考图像之间的重建运动特征,对参考图像进行变换,以得到预测图像;
确定第二残差,第二残差为预测图像与当前图像之间的残差;
将第二残差输入残差编码网络,以得到该残差特征。
可选地,该参考图像为参考帧的重建图像。
在本申请实施例中,通过引入相关性矩阵来拟合帧间运动,由于相关性矩阵能够表征当前特征与参考特征之间相关性较强和较弱的部分,相关性较强的部分所对应的帧间运动信息更加丰富,因此,在拟合帧间运动的过程中,基于相关性矩阵中各个元素的大小,便能够更好地拟合出相关性较强的部分对应的帧间运动,而较少地关注相关性较弱的部分对应的帧间运动。简单来说,相关性矩阵对帧间运动特征的预测有信息增强的作用,即,能够提高帧间运动特征的预测精度,进而提升压缩性能。
需要说明的是:上述实施例提供的编码装置在进行视频编码时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的编码装置与编码方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图18为用于本申请实施例的一种编解码装置1800的示意性框图。其中,编解码装置1800可以包括处理器1801、存储器1802和总线系统1803。其中,处理器1801和存储器1802通过总线系统1803相连,该存储器1802用于存储指令,该处理器1801用于执行该存储器1802存储的指令,以执行本申请实施例描述的各种的编码或解码方法。为避免重复,这里不再详 细描述。
在本申请实施例中,该处理器1801可以是中央处理单元(CPU),该处理器1801还可以是其他通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
该存储器1802可以包括ROM设备或者RAM设备。任何其他适宜类型的存储设备也可以用作存储器1802。存储器1802可以包括由处理器1801使用总线1803访问的代码和数据18021。存储器1802可以进一步包括操作系统18023和应用程序18022,该应用程序18022包括允许处理器1801执行本申请实施例描述的编码或解码方法的至少一个程序。例如,应用程序18022可以包括应用1至N,其进一步包括执行在本申请实施例描述的编码或解码方法的编码或解码应用(简称编解码应用)。
该总线系统1803除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线系统1803。
可选地,编解码装置1800还可以包括一个或多个输出设备,诸如显示器1804。在一个示例中,显示器1804可以是触感显示器,其将显示器与可操作地感测触摸输入的触感单元合并。显示器1804可以经由总线1803连接到处理器1801。
需要指出的是,编解码装置1800可以执行本申请实施例中的编码方法,也可执行本申请实施例中的解码方法。
本领域技术人员能够领会,结合本文公开描述的各种说明性逻辑框、模块和算法步骤所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件来实施,那么各种说明性逻辑框、模块、和步骤描述的功能可作为一或多个指令或代码在计算机可读媒体上存储或传输,且由基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体,其对应于有形媒体,例如数据存储媒体,或包括任何促进将计算机程序从一处传送到另一处的媒体(例如,基于通信协议)的通信媒体。以此方式,计算机可读媒体大体上可对应于(1)非暂时性的有形计算机可读存储媒体,或(2)通信媒体,例如信号或载波。数据存储媒体可为可由一或多个计算机或一或多个处理器存取以检索用于实施本申请中描述的技术的指令、代码和/或数据结构的任何可用媒体。计算机程序产品可包含计算机可读媒体。
作为实例而非限制,此类计算机可读存储媒体可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来存储指令或数据结构的形式的所要程序代码并且可由计算机存取的任何其它媒体。并且,任何连接被恰当地称作计算机可读媒体。举例来说,如果使用同轴缆线、光纤缆线、双绞线、数字订户线(DSL)或例如红外线、无线电和微波等无线技术从网站、服务器或其它远程源传输指令,那么同轴缆线、光纤缆线、双绞线、DSL或例如红外线、无线电和微波等无线技术包含在媒体的定义中。但是,应理解,所述计算机可读存储媒体和数据存储媒体并不包括连接、载波、信号或其它暂时媒体,而是实际上针对于非暂时性有形存储媒体。如本文中所使用,磁盘和光盘包含压缩光盘(CD)、激光光盘、光学光盘、DVD和蓝光光盘,其中磁盘通常以磁性方式再现数据,而光盘利用激光以光学方式再现数据。以上各项的组合也应包含在计算机可读媒体的范围内。
可通过例如一或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场 可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指前述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文中所描述的各种说明性逻辑框、模块、和步骤所描述的功能可以提供于经配置以用于编码和解码的专用硬件和/或软件模块内,或者并入在组合编解码器中。而且,所述技术可完全实施于一或多个电路或逻辑元件中。在一种示例下,编码器100及解码器200中的各种说明性逻辑框、单元、模块可以理解为对应的电路器件或逻辑元件。
本申请实施例的技术可在各种各样的装置或设备中实施,包含无线手持机、集成电路(IC)或一组IC(例如,芯片组)。本申请实施例中描述各种组件、模块或单元是为了强调用于执行所揭示的技术的装置的功能方面,但未必需要由不同硬件单元实现。实际上,如上文所描述,各种单元可结合合适的软件和/或固件组合在编码解码器硬件单元中,或者通过互操作硬件单元(包含如上文所描述的一或多个处理器)来提供。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
需要说明的是,本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请实施例中涉及到的视频、图像等都是在充分授权 的情况下获取的。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (17)

  1. 一种编码方法,其特征在于,所述方法包括:
    确定当前特征和参考特征,所述当前特征为待编码的当前图像的特征,所述参考特征为所述当前图像的参考图像的特征;
    确定所述参考特征相对于所述当前特征的相关性矩阵;
    基于所述相关性矩阵,确定帧间运动特征;
    将所述帧间运动特征编入码流。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述相关性矩阵,确定帧间运动特征,包括:
    将所述相关性矩阵输入运动编码网络,以得到所述帧间运动特征;或者,
    将所述相关性矩阵、所述当前特征和所述参考特征输入所述运动编码网络,以得到所述帧间运动特征;或者,
    将所述相关性矩阵、所述当前图像和所述参考图像输入所述运动编码网络,以得到所述帧间运动特征。
  3. 如权利要求1所述的方法,其特征在于,所述基于所述相关性矩阵,确定帧间运动特征,包括:
    将所述参考特征作为预测特征,将所述相关性矩阵、所述预测特征和所述当前特征输入运动编码网络,以得到运动特征;
    确定迭代次数;
    如果所述迭代次数小于迭代次数阈值,则将所述运动特征输入运动解码网络,以得到重建运动特征,基于所述重建运动特征,对所述参考特征进行变换,以重新确定所述预测特征,重新确定所述预测特征相对于所述当前特征的相关性矩阵,返回执行将所述相关性矩阵、所述预测特征和所述当前特征输入所述运动编码网络,以得到运动特征的步骤;
    如果所述迭代次数等于所述迭代次数阈值,则将所述运动特征确定为所述帧间运动特征。
  4. 如权利要求1-3任一所述的方法,其特征在于,所述基于所述相关性矩阵,确定帧间运动特征之后,还包括:
    基于所述帧间运动特征,确定残差特征;
    将所述残差特征编入所述码流。
  5. 如权利要求4所述的方法,其特征在于,所述基于所述帧间运动特征,确定残差特征,包括:
    将所述帧间运动特征输入运动解码网络,以得到所述当前图像与所述参考图像之间的重建运动特征;
    基于所述当前图像与所述参考图像之间的重建运动特征,对所述参考特征进行变换,以 得到所述当前图像的预测特征;
    确定第一残差,所述第一残差为所述当前图像的预测特征与所述当前特征之间的残差;
    将所述第一残差输入残差编码网络,以得到所述残差特征。
  6. 如权利要求4所述的方法,其特征在于,所述基于所述帧间运动特征,确定残差特征,包括:
    将所述帧间运动特征输入运动解码网络,以得到所述当前图像与所述参考图像之间的重建运动特征;
    基于所述当前图像与所述参考图像之间的重建运动特征,对所述参考图像进行变换,以得到预测图像;
    确定第二残差,所述第二残差为所述预测图像与所述当前图像之间的残差;
    将所述第二残差输入残差编码网络,以得到所述残差特征。
  7. 如权利要求1-6任一所述的方法,其特征在于,所述参考图像为参考帧的重建图像。
  8. 一种编码装置,其特征在于,所述装置包括:
    第一确定模块,用于确定当前特征和参考特征,所述当前特征为待编码的当前图像的特征,所述参考特征为所述当前图像的参考图像的特征;
    第二确定模块,用于确定所述参考特征相对于所述当前特征的相关性矩阵;
    第三确定模块,用于基于所述相关性矩阵,确定帧间运动特征;
    第一编码模块,用于将所述帧间运动特征编入码流。
  9. 如权利要求8所述的装置,其特征在于,所述第三确定模块用于:
    将所述相关性矩阵输入运动编码网络,以得到所述帧间运动特征;或者,
    将所述相关性矩阵、所述当前特征和所述参考特征输入所述运动编码网络,以得到所述帧间运动特征;或者,
    将所述相关性矩阵、所述当前图像和所述参考图像输入所述运动编码网络,以得到所述帧间运动特征。
  10. 如权利要求8所述的装置,其特征在于,所述第三确定模块用于:
    将所述参考特征作为预测特征,将所述相关性矩阵、所述预测特征和所述当前特征输入运动编码网络,以得到运动特征;
    确定迭代次数;
    如果所述迭代次数小于迭代次数阈值,则将所述运动特征输入运动解码网络,以得到重建运动特征,基于所述重建运动特征,对所述参考特征进行变换,以重新确定所述预测特征,重新确定所述预测特征相对于所述当前特征的相关性矩阵,返回执行将所述相关性矩阵、所述预测特征和所述当前特征输入所述运动编码网络,以得到运动特征的步骤;
    如果所述迭代次数等于所述迭代次数阈值,则将所述运动特征确定为所述帧间运动特征。
  11. 如权利要求8-10任一所述的装置,其特征在于,所述装置还包括:
    第四确定模块,用于基于所述帧间运动特征,确定残差特征;
    第二编码模块,用于将所述残差特征编入所述码流。
  12. 如权利要求11所述的装置,其特征在于,所述第四确定模块用于:
    将所述帧间运动特征输入运动解码网络,以得到所述当前图像与所述参考图像之间的重建运动特征;
    基于所述当前图像与所述参考图像之间的重建运动特征,对所述参考特征进行变换,以得到所述当前图像的预测特征;
    确定第一残差,所述第一残差为所述当前图像的预测特征与所述当前特征之间的残差;
    将所述第一残差输入残差编码网络,以得到所述残差特征。
  13. 如权利要求11所述的装置,其特征在于,所述第四确定模块用于:
    将所述帧间运动特征输入运动解码网络,以得到所述当前图像与所述参考图像之间的重建运动特征;
    基于所述当前图像与所述参考图像之间的重建运动特征,对所述参考图像进行变换,以得到预测图像;
    确定第二残差,所述第二残差为所述预测图像与所述当前图像之间的残差;
    将所述第二残差输入残差编码网络,以得到所述残差特征。
  14. 如权利要求8-13任一所述的装置,其特征在于,所述参考图像为参考帧的重建图像。
  15. 一种编码装置,其特征在于,包括处理器和接口电路,所述处理器通过所述接口电路接收和/或发送数据;所述处理器被配置为调用存储在存储器中的程序指令,以执行如权利要求1-7任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-7任一所述的方法的步骤。
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品内存储有计算机指令,所述计算机指令被处理器执行时实现权利要求1-7任一所述的方法的步骤。
PCT/CN2023/076925 2022-03-31 2023-02-17 编码方法、装置、存储介质及计算机程序产品 WO2023185305A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210345172.7A CN116962715A (zh) 2022-03-31 2022-03-31 编码方法、装置、存储介质及计算机程序产品
CN202210345172.7 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023185305A1 true WO2023185305A1 (zh) 2023-10-05

Family

ID=88199134

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/076925 WO2023185305A1 (zh) 2022-03-31 2023-02-17 编码方法、装置、存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN116962715A (zh)
WO (1) WO2023185305A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1913640A (zh) * 2006-08-11 2007-02-14 宁波大学 多模式多视点视频信号编码压缩方法
CN101411201A (zh) * 2006-03-27 2009-04-15 松下电器产业株式会社 图像编码装置以及图像解码装置
CN106878668A (zh) * 2015-12-10 2017-06-20 微软技术许可有限责任公司 对物体的移动检测
CN112767277A (zh) * 2021-01-27 2021-05-07 同济大学 一种基于参考图像的深度特征排序去模糊方法
US20210329286A1 (en) * 2020-04-18 2021-10-21 Alibaba Group Holding Limited Convolutional-neutral-network based filter for video coding
CN113570606A (zh) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 目标分割的方法、装置及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101411201A (zh) * 2006-03-27 2009-04-15 松下电器产业株式会社 图像编码装置以及图像解码装置
CN1913640A (zh) * 2006-08-11 2007-02-14 宁波大学 多模式多视点视频信号编码压缩方法
CN106878668A (zh) * 2015-12-10 2017-06-20 微软技术许可有限责任公司 对物体的移动检测
US20210329286A1 (en) * 2020-04-18 2021-10-21 Alibaba Group Holding Limited Convolutional-neutral-network based filter for video coding
CN112767277A (zh) * 2021-01-27 2021-05-07 同济大学 一种基于参考图像的深度特征排序去模糊方法
CN113570606A (zh) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 目标分割的方法、装置及电子设备

Also Published As

Publication number Publication date
CN116962715A (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
Hu et al. Learning end-to-end lossy image compression: A benchmark
RU2565877C2 (ru) Способ и устройство для определения соответствия между синтаксическим элементом и кодовым словом для кодирования переменной длины
Abou-Elailah et al. Fusion of global and local motion estimation for distributed video coding
CN110248189B (zh) 一种视频质量预测方法、装置、介质和电子设备
WO2022155974A1 (zh) 视频编解码以及模型训练方法与装置
CN1622593B (zh) 用于实现信噪比可伸缩性的视频处理的装置和方法
WO2022194137A1 (zh) 视频图像的编解码方法及相关设备
CN113747242B (zh) 图像处理方法、装置、电子设备及存储介质
WO2023185305A1 (zh) 编码方法、装置、存储介质及计算机程序产品
WO2023225808A1 (en) Learned image compress ion and decompression using long and short attention module
WO2023050433A1 (zh) 视频编解码方法、编码器、解码器及存储介质
WO2023169303A1 (zh) 编解码方法、装置、设备、存储介质及计算机程序产品
WO2023165487A1 (zh) 特征域光流确定方法及相关设备
US20240242467A1 (en) Video encoding and decoding method, encoder, decoder and storage medium
WO2018120290A1 (zh) 一种基于模板匹配的预测方法及装置
US20170332094A1 (en) Super-wide area motion estimation for video coding
WO2024078403A1 (zh) 图像处理方法、装置及设备
CN117041597B (zh) 一种视频编码、解码方法、装置、电子设备及存储介质
WO2023279968A1 (zh) 视频图像的编解码方法及装置
WO2023197717A1 (zh) 一种图像解码方法、编码方法及装置
WO2023231775A1 (zh) 滤波方法、滤波模型训练方法及相关装置
US20240015336A1 (en) Filtering method and apparatus, computer-readable medium, and electronic device
WO2022253088A1 (zh) 编解码方法、装置、设备、存储介质、计算机程序及产品
KR20240027618A (ko) 컨텍스트 기반 이미지 코딩
US20230254592A1 (en) System and method for reducing transmission bandwidth in edge cloud systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777684

Country of ref document: EP

Kind code of ref document: A1